What Can We Do About Reward Hacking?: Concrete Problems in AI Safety Part 4

Рет қаралды 112,447

Күн бұрын

Three different approaches that might help to prevent reward hacking.
New Side Channel with no content yet!: / @robertmiles2
Where do we go now?: • Where do we go now?
Previous Video in the series: • Reward Hacking Reloade...
The Concrete Problems in AI Safety Playlist: • Concrete Problems in A...
The Computerphile video: • General AI Won't Want ...
The paper 'Concrete Problems in AI Safety': arxiv.org/pdf/1606.06565.pdf
With thanks to my excellent Patreon supporters:
/ robertskmiles
Steef
Sara Tjäder
Jason Strack
Chad Jones
Stefan Skiles
Katie Byrne
Ziyang Liu
Jordan Medina
Kyle Scott
Jason Hise
David Rasmussen
Heavy Empty
James McCuen
Richárd Nagyfi
Ammar Mousali
Scott Zockoll
Charles Miller
Joshua Richardson
Fabian Consiglio
Jonatan R
Øystein Flygt
Björn Mosten
Michael Greve
robertvanduursen
The Guru Of Vision
Fabrizio Pisani
A Hartvig Nielsen
Volodymyr
David Tjäder
Paul Mason
Ben Scanlon
Julius Brash
Mike Bird
Taylor Winning
Roman Nekhoroshev
Peggy Youell
Konstantin Shabashov
Dodd Almighty
DGJono
Matthias Meger
Scott Stevens
Emilio Alvarez
Michael Ore
Robert Bridges
Dmitri Afanasjev
Brian Sandberg
Einar Ueland
Lo Rez
C3POehne
Stephen Paul
Marcel Ward
Andrew Weir
Pontus Carlsson
Taylor Smith
Ben Archer
Ivan Pochesnev
Scott McCarthy
Kabs Kabs Kabs
Phil
Philip Alexander
Christopher
Tendayi Mawushe
Gabriel Behm
Anne Kohlbrenner
Jake Fish
Jennifer Autumn Latham
Filip
Bjorn Nyblad
Stefan Laurie
Tom O'Connor
Krethys
PiotrekM
Jussi Männistö
Matanya Loewenthal
Wr4thon

Пікірлер: 272

@aenorist2431 6 жыл бұрын

I love how you could not get through "always serves the interest of the citizens" with a straight face. Hilariously depressing.

@inyobill 5 жыл бұрын

Governments at least have a mandate to work for the benefit of the citizens. Obviously the effectivities are problematic. Other systems, such as corporations, have no such mandate. Would you prefer empowering a system that has a mandate to enact your goals, or a system that cares nothing for YOUR goals?

@manishy1 4 жыл бұрын

@@inyobill Unfortunately, this doesn't take into account the reward hacking nature of government - it is ideal to create a system that rewards the government without benefiting the citizens. Furthermore, it is beneficial to trick the citizens into thinking they are benefited when, in fact, the government is exclusively benefited. There are only so many tax dollars and the government benefits most by keeping it all (what manifests as a ruling class). This is why politicians tend to have a disproportionate increase in wage to the general population. Whereas, with a corporation, which has no mandate to look after citizens (since it doesn't possess citizens, only employees and customers) will tend to act in the best interests of its goals, which is to become a lucrative business. In many instances it is beneficial to benefit the customer (and some employees) in order to generate more wealth. Of course, when certain entities (in particular, regulatory bodies) demand other goals have higher weight, that concept goes out the window as it becomes more beneficial to prioritize the desires of those regulatory elements. This is only practical when the reward (money) is completely controlled by that regulatory body (the various federal reserves). Look at google adsense vs the CCP social credit scheme. Both utilize an inordinate amount of surveillance, assisted by neural networks to monitor the behaviour of people. One produces handy advertisements and convinces me to buy things I already want (even if I don't know it yet) but the other will imprison me for the colour of my skin/where I was born. And in both instances, each system works exactly as designed.

@inyobill 4 жыл бұрын

@@manishy1 Unfortunately that paradigm doesn't take into account the greater complexity of the system, and self-awareness of the participants: government officials, employees and citizens. Do you vote? If Citizens aren't getting return on their investments, then they need to elect different representatives. Neither does your statement negate my original comment.

@manishy1 4 жыл бұрын

@@inyobill Your argument implies governments have a mandate to work for the benefit of the citizens. I contradicted that. Self-awareness introduces more complexity to the system, certainly, but this doesn't contradict the observed phenomena - elected officials rarely represent the interests of the general populace; they pay lip service then proceed to employ an army of bureaucrats to suppress the people. I extrapolated that this behaviour can be described as reward hacking.

@inyobill 4 жыл бұрын

@@manishy1 That is the mandate. Governments are more or less successful in carrying out that mandate. You in no way proved that wrong. If your elected officials are not representing your interests, you and your ilk are voting for the wrong people.

@PragmaticAntithesis 5 жыл бұрын

2:48 "Your AI can't hack its reward function by exploiting bugs in your code if there are no bugs in your code." Brilliant!

@baranxlr 3 жыл бұрын

If I were the one writing AI, I would simply not make any mistakes. So simple

@giannis_m Жыл бұрын

@@baranxlr I am just built different

@texti_animates 9 ай бұрын

@@baranxlr 2 years late but omg im trying it and its so hard to write an AI

@IAmNumber4000 4 жыл бұрын

“Wireheading, where the fact that the reward system is a physical object in the environment means that the agent can get very high reward by physically modifying the reward system itself.” Drug addict robots are our future

@harryellis9571 Жыл бұрын

There's actually a really interesting distinction between the two. Drugs tend to make addicts less capable so stopping their intake isnt too difficult (imagine if heroin made you smarter and more active, youd probably get pretty good at ensuring you always have some). This isnt the case for an AGI. A wireheaded AGI isnt just useless at completing its task it's actively going to ensure you cant prevent it from wirehacking itself. E.g. you try to take the bucket off its head and it kills you ... maybe they are similar to drug addicts in that sense

@pafnutiytheartist Жыл бұрын

@@harryellis9571 "Imagine if heroin made you smarter" - that's basically the plot of limitless

@inthefade 4 жыл бұрын

I love how difficult this problem is because my first thought was that there should be a reward for the agent to be modified, but then I realized that would instantly subvert the other rewards systems because the AGI would act to try to be modified then. This channel has made me feel that rewards systems are completely untenable and useless for an AGI.

@paradoxica424 Жыл бұрын

we spend a quarter of a lifetime navigating arbitrary reward systems set up by large children who also don’t truly care about the reward systems … reward systems are also useless for humans imo, but to a lesser extent

@numbdigger9552 Жыл бұрын

@@paradoxica424 pain and pleasure are reward systems. They are also VERY effective

@keroqwer1761 6 жыл бұрын

Pyro and Mei blasting their weapons on the toaster made my day :D

@NoobsDeSroobs 4 жыл бұрын

Kero Qwer toast*

@metallsnubben 6 жыл бұрын

"I'm gonna start making more videos quickly so I can improve my ability to make videos" Why does this sound a bit familiar... ;)

@Krmpfpks 6 жыл бұрын

I was actually laughing out loud at 4:49 , thank you!

@SalahEddineH 6 жыл бұрын

Me too :D That was lowkey and perfect :D

@MasterNeiXD 6 жыл бұрын

Krmpfpks Even he couldn't hold it in.

@Stedman75 6 жыл бұрын

I love how he couldnt say it with a straight face... lol

@SecularMentat 6 жыл бұрын

Yuuuup. That was hilarious. I wonder how many takes he had to do to not laugh his ass off.

@spiveeforever7093 6 жыл бұрын

At the end he has a blooper, "take 17" XD

@dr-maybe 6 жыл бұрын

These videos of you are great on many level. The topic is extremely important, the way of explaining is very accessible, the humor is subtle yet brilliant and the pacing is just perfect.

@NiraExecuto 6 жыл бұрын

4:47 "...ensuring that the government as a whole always serves the interests of the citizens. But seriously, I'm not that hopeful about this approach." Gee, I wonder why.....

@circuit10 4 жыл бұрын

I mean we all think negatively about this but honestly 99% of what governments do is for the good of people, and it's so much better than 1000 years ago for example or better than a dictatorship

@Orillion123456 4 жыл бұрын

@@circuit10 Well, for the most part, most modern governments are not strictly "better" than an absolute government system, just "less extreme". An absolute ruler (feudal, imperial, dictatorial, whatever) can wreak terrible havoc sure, but can also implement significant positive changes quickly and easily, there are historical examples of both. Modern governments are optimized to make it really slow to change anything and for there to be many points at which a change can get paused or cancelled before being put into place - we are failing to get anything important done any time soon but hey at least no first world nation has gone full hitler yet so we have that going for us? In the end the optimal government is one with actual power to quickly implement sweeping changes where necessary (like an absolute ruler of old times would have) but with proper screening process to ensure competence and benevolence. Unfortunately such a thing is impossible to implement (you can't get 2 people, let alone the entire planet, to agree on which criteria make for a competent and/or benevolent ruler, and you can't physically implement reliable tests to ensure any given person meets them). So in a way politics and AI safety are kinda similar in terms of the problems they face.

@debaronAZK 3 жыл бұрын

@@circuit10 where do you get this number from? your ass? income inequality has never been greater, and never before have so few people owned so much wealth and power as now, and it's only getting worse.

@harrisonfackrell 3 жыл бұрын

That little grin and barely-repressed chuckle when you started talking about the government really got me.

@dannygjk 5 жыл бұрын

The political system as it is currently administered, (ha ha), selects the candidates that are most fit to win an election which has a deleterious effect in general on society.

@KuraIthys 5 жыл бұрын

Yes. There's also some evidence that suggests the implementation of laws themselves is biased towards selecting laws chosen by those with the most influence on society, not by the largest part of society. (and this is usually synonymous with the wealthiest.)

@TheRealPunkachu 4 жыл бұрын

Votes have become a target so they are no longer a good measurement :/

@irok1 3 жыл бұрын

@@TheRealPunkachu true

@MarkusAldawn 2 жыл бұрын

@@TheRealPunkachu I'd clarify to "winning a plurality of votes," since definitely there's strategies which lose you votes, but gain you voter _loyalty._ But yeah- the end goal is to align your vote, and help align other people's votes, to support the candidate that will do the most good things. There's probably a function you could draw up to account to describe "I know this person lies in 50% of their promises, but this person hasn't been elected to get to keep them, so I have to evaluate the likelihood of them keeping >50% of their promises," and vary it to your liking (maybe average promise-keeping of politicians most similar to them ideologically? But then that would fail based on the fact a single politician won't be able to inact their campaign promise to, for example, change the national flag without support, so it's limited to what degree they *could* keep that promise? Certainly politicians would very quickly start clarifying their promises to be "I'll do my best" and so on). Anyway, humans are adversarial training networks for other humans, that's just what society means.

@himselfe 6 жыл бұрын

It's nice to see more of what AI safety researchers are considering to try and solve the problems of AI safety. Touching on what you said about software bugs, I think careful engineering should be an absolute cornerstone of AI development. Nature has natural selection and the inherent difficulty of survival to iron out bugs, and it doesn't care if entire species get wiped out in the debug process, we don't have that luxury. AI code should be considered critical code, and subject to the most stringent quality standards, not only to prevent the AI from exploiting bugs in its own code, but also to prevent malicious entities from doing the same to manipulate the AI.

@sperzieb00n 6 жыл бұрын

AGI always makes me think about the first and second mass effect, and how true AGI is basically illegal in that universe.

@Shrooblord 6 жыл бұрын

Ah, the last method you discuss is quite smart. I love the idea of the AI predicting what would happen if it attempted to hack its reward system, and seeing that the "real world state" is different than its "perceived world state", and also less rewarding than actually making the real world state a better environment as defined by its reward function. It almost makes it feel like the AI is programmed with an understanding of consequences, and that consequences matter to its goals.

@dtanddtl 6 жыл бұрын

This needs to be added to the playlist

@serovea333 6 жыл бұрын

Just binged watched your whole channel, I love how the *phile series has had a ripple effect on all you guys. Keep up the great work!

@SalahEddineH 6 жыл бұрын

New Rob Miles video! Yaaaay! I love your work, damnit! Keep rocking! Seriously!

@SalahEddineH 6 жыл бұрын

:D 4:40 Was just PERFECT :D

@kurtiswithak 6 жыл бұрын

2:48 amazing meme usage, laughed out loud

@qd4192 6 жыл бұрын

As a psychologist, it seems to me that an advanced ai systems fits perfectly the standard defintion of a sociopath. If it were administered Robert Hare's test for Sociopathy, it would score the maximum, having no genuine concern for others, but only for maximizing it's own rewards. Have you considered this perspective? (utterly selfish - inherently dangerous)

@dannygjk 5 жыл бұрын

ie unless altruism naturally emerges from it's interactions with it's environment and subsequent consequences coupled with it's 'mental' development of an advanced AI then it will be at best completely neutral toward humans.

@NathanTAK 6 жыл бұрын

NEW ROB MILES VIDEO. THIS IS THE GREATEST DAY OF MY LIFE.

@zxb995511 6 жыл бұрын

Until the next one

@NathanTAK 6 жыл бұрын

+zxb995511 No no no, the delay gave me terminal cancer and I only have 30 seconds left to live.

@brbrmensch 6 жыл бұрын

ideas and hair start to get really interesting

@jayxi5021 6 жыл бұрын

The thumbnail thought... 😂😂

@chrisaldridge545 6 жыл бұрын

Hi Robert, I’ve watched most of your videos now and just want to say many thanks. You are a really great communicator and I rate your input as one the top 2 YT sources I’ve discovered so far. You asked in another video for comments on what direction future videos should take..I think your idea of reviewing current papers and explaining them to curious laymen like myself would be best for me. It’s exactly what my other favourite channel “two minute papers” does with more of a focus on CGI , fluid simulations etc. I wish all the videos were longer too say 20 -40 mins each. (If I could afford to help patronise you guys I would, but I’m just a old/obsolete dumb ex-Programmer green with envy at the progress possible using the ML, AI tools of today.

@TheMusicfreak8888 6 жыл бұрын

Fantastic video once again. Seriously you never disappoint.

@magellanicraincloud 6 жыл бұрын

Brilliant videos, I always feel more educated after listening to what you have to say. Thanks Rob!

@DaveGamesVT 6 жыл бұрын

These are always fascinating. Thanks.

@nilstrieb 4 жыл бұрын

5:22 OMG an Overwatch reference love you!

@departchure 6 жыл бұрын

Thanks for your videos Rob. I'd love for you to address something that has been confusing me about AI safety. For an AGI that is improving itself, its own code becomes part of its environment. It should have every incentive to reward hack its code. Even if you try to sandbox it, if it's smart enough, it's a high reward target to go after the code dictating its reward function. The same should be true of a utility function (because it seems like it would look the same). Modifying its utility function would allow it to achieve the highest possible value of its utility function. And even if you managed to keep it away from its utility or reward function, it would still like to wirehead itself elsewhere in its code. A stamp doesn't actually have to be collected anymore, for example, to be registered as collected, a billion times. How is it even possible to motivate a truly smart AGI to do anything? If it's really smart, it seems like it has to be modifying its code, and it'll be smart enough to realize that the easiest/fastest/most perfect way to meet whatever its goal is would be to cheat inside its software. Perfect score, every time.

@departchure 6 жыл бұрын

Maybe that is a safety net. You try to get your AGI to solve your problem to your satisfaction before it figures out how to wirehead itself, knowing that it will ultimately wirehead itself before it destroys the universe looking for stamps.

@JesseCaul 6 жыл бұрын

I loved your videos on computerphile a long time ago. I wish I had found your channel sooner.

@Ebdawinner 6 жыл бұрын

Keep up the great work brother. U bring insight that is worth more than gold.

@interstice8638 6 жыл бұрын

Great video Rob, your videos have inspired me to pursue an education in AI and AI safety.

@simonmerkelbach9350 6 жыл бұрын

Really interesting content and the production quality of your videos has also gotten topnotch!

@General12th 6 жыл бұрын

This is such a good channel! I love it!

@AlexMcshred6505plus 6 жыл бұрын

The pyro/mei toast was hillarious, wonderful as always

@SwordFreakPower 6 жыл бұрын

Top thumpnail!

@papa515 6 жыл бұрын

The concepts discussed in these videos can be rated by comparing the behaviors that we want an AGI to have with how we analyze our own (human) behaviors. I found in this video exploring an especially strong connection. This means that this way of looking at how we should view AGI is not only just important for AI safety but also the creation of AGI in general. The only model we have for GI is between our collective ears. So to have a chance at constructing an AGI we will need on a very deep and complete level an understanding of the engine between our ears. As we come to understand our own mentation on deeper and deeper levels we will not just understand how to go about constructing an AGI but much more importantly we will understand ourselves. And this new understanding will help us learn how to behave as a modern social species and to maximize our chances of persisting with ever more advanced and complex technologies.

@electro_fisher 6 жыл бұрын

A+ editing, great video

@thrillscience 6 жыл бұрын

Shana Tova! Have a great new year. Your videos are great.

@sunejohansson 6 жыл бұрын

Would love to see more about the code golf / how the program works or something like that :-) Keep up the good work. Cheers from Denmark

@splitzerjoke 6 жыл бұрын

Great video, Rob :)

@kiri101 6 жыл бұрын

Thank you for the content!

@user-wi3db6wu8d 3 жыл бұрын

Your videos are really great !

@douglasoak7964 6 жыл бұрын

It would be really interesting to see coded examples of these concepts.

@sakurahertz 4 жыл бұрын

Ok I'm 3 years late to this video (and channel) but I love that Mei from Overwatch reference Also amazing content, I always find this kind of stuff fascinating

@Pheonix1328 4 жыл бұрын

Agents "fighting" each other and keeping each other in check reminded me a bit of the Magi from Evangelion.

@AmbionicsUK 6 жыл бұрын

Yey been waiting for this. Watching now...

@mykel723 6 жыл бұрын

Good idea, more people should post their links in the dooblydoo

@simeondermaats 4 жыл бұрын

Artifexian's been doing it for a couple of years

@tetraedri_1834 6 жыл бұрын

What if AGI realizes its reward function is being modified, and also realizes that the new reward function would for some reason give it higher reward once the new reward function is applied? Maybe it won't allow people to change its reward function until it ensures the new system would give it higher reward...? The rabbithole never ends...

@David_Last_Name 4 жыл бұрын

Someone else in the comments had this exact same idea, but then pointed out that would encourage the agi to never do what you wanted in order to force you to give it a new reward function. You can't ever win!!

@tomahzo 2 жыл бұрын

4:49 : Hard to say that with a straight face, eh? Yeah, I feel you ;D. 5:23 : That drawing is such a delight ;D.

@jupiter4602 4 жыл бұрын

I know this video is over two years old now, but I couldn't help but notice the problem with model lookahead being somewhat similar to the problem with adversarial reward systems: If the general AI is able to subvert the model lookahead penalty, which in some cases could potentially happen by complete accident, then we're left with an AI that can plan what it wants without penalty again.

@CybranM 6 жыл бұрын

These videos are so interesting.

@bballs91 6 жыл бұрын

Glad I'm not the only one who noticed the overwatch characters 😂😂 well done Rob

@gr00veh0lmes 5 жыл бұрын

You ask some damn smart questions.

@BlackholeYT11 4 жыл бұрын

"Pre-arachnophage" - as another former student I almost died when you brought that up, I was there in the room at the time xD

@David_Last_Name 4 жыл бұрын

Eh........ok I give up, explain please? This sounds both interesting and terrifying.

@ShazyShaze 6 жыл бұрын

That's a darn great thumbnail

@grugnotice7746 6 жыл бұрын

Id, ego, and superego as adversarial agents. Very interesting.

@XxThunderflamexX 3 жыл бұрын

Could you combine the agent and "utility function defender" into the same agent, and produce something that is "afraid" to wirehead itself? Something that periodically predicts what worldstates it would expect to observe if it suddenly operated with the goal of tricking its own systems, and then adding the predicted worldstates to a blacklist to check against with locality hashing. Admittedly, the hard part is probably in defining "tricking its own systems", which might be the core of the problem itself - how do we write a utility function that we unambiguously want to be maximized even all else being equal?

@DoveArrow 2 жыл бұрын

Your comment about the flame thrower and the nitrogen gun trying to make toast perfectly describes our democratic systems. Maybe that's why Churchill said democracy is the worst form of government, except for all the others."

@amargasaurus5337 4 жыл бұрын

Imagine making an AGI with "keeping peace amongst humans" as it's goal, and ten years later coming back to find out it nerve stapled the entire human population so that noone felt the need to fight for anything

@roger_isaksson Жыл бұрын

The problem with rewarding hallucinated negative utility (anxious AI) is that it causes inaction from the observation that most actions got bad results (there’s a lot of negative utility that can be fantasized) Black vs. white lists. The black lists usually “wins”, where as white lists require effort to craft since it is mutually exclusive and might “sieve” a bit harshly. There’s also the problem of no good outcome irregardless of future projection. Then one is forced to “embrace the suck” of avoiding a (much) worse disaster. Some actions by the AGI will thus manifest as “apocalyptic”, even though AGI might strive to avoid ‘The Apocalypse’. I love this stuff. 🤭👍

@SS2Dante 6 жыл бұрын

("Why not just"-style question, for anyone who wants to jump in and show me the stuff I've missed/overlooked/been dumb about) A lot of the trouble seems to come from the AI's ability to affect the physical world. Are there inherent problems to designing an AI such that the physical nature of the AI and utility function sync up to prevent this from happening? Note: this is slightly different from sandboxing, where the utility function still has free reign to attempt "escape" through social engineering etc. I'm imagining a computer, which functions as an oracle (ask a question, get an answer) with an input of...everything we have (functionally, the internet), a keyboard, and a screen for output. The Utility function would look something like 1) Default state (i.e. no questions waiting to be answered) - MAX score 2) Any action other than those specified in parts (4) to (5) - MIN score 3) A question is fed in - 0 score 4) Light up pixels on screen to answer question to the best of it's ability (with current information) - 10 score 5) Light up pixels on screen to explain why it is unable to find an answer - 8 score As far as I can see, point (2) completely neuters the AI's desire to do...well, anything, except answer the questions that are given. In the default state it has max reward, but can't set up any kind of existence protection, as that violates (2), so it would just...sit there. Once a question is fed in (which it can't prevent, for the same reason), it is incentivised to answer the question (an improvement in score) as a short term goal, which allows it to clear the question and jump back up to the (1) state, where it idles again. The biggest danger I can see is in how we specify parts (4) to (5), but if worded clearly, any attempt at social engineering etc. would fall outside of the remit of "lighting up pixels to answer the question if you are able to do so". Obviously such an AI would be far slower and less effective than one that can actually take actions, but is certainly better than nothing! Anyway, as I said I'm sure I've missed....quite a few somethings, so if you know what's up please come correct this! Oh, and great video as always Rob, really enjoying the channel!

@thesteaksaignant 5 жыл бұрын

I know it's been a year but I think you left out the tricky part : how to get meaningful / "good" answers. You need to evaluate the quality of the possible answers and choose the best one according to some criteria (that is : a utility function). Reward hacking of this utility function will be a good strategy to choose the answer with the maximum value. All the risks described in this video apply. For instance, if the reward for the answer is given by humans then human manipulation is a good strategy, by giving answers that will please human / correspond to what they think is a good answer (regardless of what the true answer is).

@MarcoServetto 6 жыл бұрын

For the gibbon/panda, a simple way to make the system more resistant could be to generate 10 random filters like that, and pre apply those on the original image. Then try to evaluate all those 10 images and see if there is some "common" result. Indeed our eyes are full of noise all of the time.

@Frumpbeard Жыл бұрын

This is called data augmentation, and it's already done all the time.

@MarcoServetto Жыл бұрын

@@Frumpbeard and how can the random noise survive as an attack after this 'data augmentation'?

@BatteryExhausted 6 жыл бұрын

Also, I was trying to explain to a friend that top thinkers discard Asimov's laws. Pls could you make a video directly dealing with Asimov. Thanks. Love your work.

@MrSparker95 6 жыл бұрын

He already did a video about that on Computerphile channel: kzbin.info/www/bejne/bYGuqWahiJyZaqM

@clayfare9733 2 жыл бұрын

I know I'm late to the party here, but while thinking about the solutions to reward hacking that you listed here I can help but wonder if it would be possible to set something up like "You get 100 reward if you collect 100 stamps, but you lose 1 point for every stamp you collect over 100. If you attempt to modify your code you lose all/infinite points."

@Jo_Wick 4 жыл бұрын

That analogy at 5:22 has a critical flaw; the liquid nitrogen would evaporate and smother the flame thrower's flames every time through the nitrogen displacing the oxygen in the air.

@bassett_green 6 жыл бұрын

Dank memes and AGI, what's not to love

@LuminaryAluminum 6 жыл бұрын

Love the Pyro and Mei reference.

@DisKorruptd 4 жыл бұрын

regarding the bucket bot, I was actually just thinking that, in order to prevent the dolphin problem, the reward would be greater depending on the size of the things it picks up, so it is rewarded more for larger bits of trash, getting -100 for each piece of trash it sees would demotivate it from making one piece of trash into 2 pieces of trash, and it'd rather collect 1 piece of trash worth 500 rather than 2 pieces worth 200,

@WilliamDye-willdye 6 жыл бұрын

I strongly disagree with the approach that internal conflict between powerful systems is dubious (5:05), but the difference between Mr. Miles and I may be just a matter of semantics. I doubt if he truly advocates a system of government in which we give total power to the prime minister and then simply tell the public to make sure that they elect a very good prime minister. He later talks about components within the AI that lobby for different conclusions (at 7:16, for example), so maybe we only differ in how we draw the boundaries around what constitutes a separate entity in the conflict. For background, my own approach to AI safety (tentatively entitled "distributed algorithmic delineation") treats division of power as a critical component. Moreover, I fear that unification of power is to a large extent a natural long-term effect in any social organization with a high degree of self-interaction. Therefore a primary design consideration of a good safety system needs to place a high priority on defeating this natural tendency to centralize (sorry, "centralise") power. Well, like I said; maybe the differences between us are semantic. I still find these videos very interesting and often informative, and I'm glad that Mr. Miles is promoting AI safety as a proper field of study. For too many years, almost all of us with an interest in the topic could only make it an off-duty hobby. It's a delight to see well-written videos and papers about a topic that has interested me for so long.

@Paul-rs4gd 5 жыл бұрын

So we wouldn't want to give all the power to the prime minister, but it might get a bit better if the power is divided among 3 bodies which watch over each other. The logical conclusion seems to be to increase the number of entities. In human society it is said that absolute power corrupts. When you have a lot of humans (or other types of agent) interacting things seem to be fairer when there are more of them, provided that no individuals gain drastically more power than others. It is the principle of avoiding monopolies of power. Hierarchies do form, so the system certainly has its problems, but nobody has ever managed to rule the whole earth so far !

@XxThunderflamexX 3 жыл бұрын

Would multi-agent systems be resistant to wireheading? If the multiple agents are not normally in conflict with each other - say, they were able to delegate tasks based on specialization - there wouldn't be as much of a risk of humans being caught in the crossfire. It would only be when one agent starts misbehaving, wireheading itself or trying to take disproportionate power for itself over the other agents, that the other agents would be incentivised to step in and realign the agent with their shared goals.

@claytonharting9899 3 жыл бұрын

An AGI that actively wants its reward function changed could make a very good character for a story. I imagine it as a schizophrenic ai that settles into a strategy where it randomly changes its own reward function. Maybe one day it will drop a bomb on a city, then the next day come by and go “oh no what happened? Here, let me help you”

@starcubey 6 жыл бұрын

2:46 The best part of the video right there

@chrisdaley2852 6 жыл бұрын

What utility function allows the adversarial agent to recognise reward hacking? Wouldn't it just minimise the reward?

@syzygy6 4 жыл бұрын

You give the three-branch-government analogy as an example of overkill, and I agree with your reason but I also think you might say that the programmer is already a general purpose intelligence performing judiciary functions, And there may be value in formalizing the role human agents play in training AI rather than thinking of human agents as existing entirely outside of the system. For that matter, I wonder how much we can learn about effective governance from training AI; if you think of corporations as artificial intelligence, then you encounter the same problems with reward hacking.

@4.0.4 6 жыл бұрын

The "careful engineering" idea is so good maybe we should apply it to every mission-critical software! Oh... Wait

@williamchamberlain2263 4 жыл бұрын

1:10 I'd heard a few schools in Qld used to dissuade kids from taking some uni-entrance Year11-12 subjects to improve their positions in league tables.

@donkriegnaszojcze 6 жыл бұрын

4:40 So like Magi System from NGE?

@cheydinal5401 4 жыл бұрын

I actually really like the idea of opposing subsystem, in the "branches of government" style. Sure, multi-branch government isn't perfect, but it's more stable than single-branch government. Not doing multi-branch AI would mean single-branch AI, which as I said can easily become dangerous. The opposing AI is basically an AI that is trained to make sure there is suffient AI safety. Then divide the other into a "legislative" AI that only decides what to do, and an "executive" AI that actually implements it in the real world, but those actions can only be taken if thag opposing "judicial" AI approves it

@desiraepelletier719 5 жыл бұрын

What are your thoughts on the Blockchain project by Dr Ben Goertzel SingularityNET AGI coin?

@owlman145 6 жыл бұрын

Human reward hack all the time, but are kept in check by other humans. In this case, the society as a whole is the reward function that prevents us from doing really bad things. And yes, it's not a perfect system and so we probably shouldn't make super AIs use it, though it might end up being the case anyway.

@DeusExRequiem 5 жыл бұрын

The best method would be a AGI ecosystem where there's many versions of an AGI trying to achieve similar goals, all while in competition with other AGIs that might not have the same goals, might be in opposition for different reasons, or might do unrelated things to those goals. All these thought experiments assume a world where there's only one AGI and once it becomes a problem there's nothing around to challenge it.

@StardustAnlia 5 жыл бұрын

Are drugs human reward hacking?

@SamuelDurkin 5 жыл бұрын

If the end goal is more fuzzy and it's not something specific like stamps... would it work is the end goal was "do what the humans like", but don't tell it what they like so it has to try and guess what they like. This end goal sort of means you can keep changing it's end goals, as your liking of things is it's end goal.. when you stop liking something, it has to change what it's doing, which would including turning itself off, if that is something you would like it to do..

@dannygjk 5 жыл бұрын

For that to work humans would have to change what they like which is unlikely to happen unless some force causes the humans to change what they like.

@snowballeffect7812 6 жыл бұрын

i love when my youtubers reference each other. makes the walls of my echochamber stronger. Sethblings vids on his work on the SMW are amazing.

@Thundermikeee 4 жыл бұрын

Wouldn't multiple superintelligent agents that are supposed to keep one another in check have a risk of cooperation as well? or any adversorial reward agent and the primary agent

@grimjowjaggerjak 2 жыл бұрын

No since they don't have the same utility function.

@ChrisBigBad 4 жыл бұрын

now i have to go and play a round of universal paper clips!

@NoOne-fe3gc 5 жыл бұрын

We can see the same happening in the game industry currently. Because some studios use metacritic score as a gauge of success, they orient their game to please the critics and get a higher score, instead of making a good game for the fans

@recklessroges 6 жыл бұрын

It feels like GAI research is re-covering much of the ground done by educational psychologists. I look forward to the flow of ideas being reversed and possibly implemented by GAI.

@VorganBlackheart 6 жыл бұрын

When you click on random ai video and then realize it's Robert's new upload ^.^

@RipleySawzen 3 жыл бұрын

4:47 How many takes was that to keep an almost straight face?

@RonaldSL- 6 жыл бұрын

Yay

@boldCactuslad 6 жыл бұрын

Pyro and Mei vs Toast, such a nice sketch

@Davesoft 4 жыл бұрын

The 'superior reward function' made me think of evangelion. They had 3 AI, hosted on human brains cuz why not, that argue and veto each other into either pre-existing procedures or complete inaction. Ignoring the sci-fi elements, it seems a nice idea, but I'm sure they'd find a way to nudge and wink at each other and unify one day :P

@hugoehhh 6 жыл бұрын

Dude. Are we programmed? Anxiety and our actions make so much sense from ab agi programmers point of view

@rmcq1999 Жыл бұрын

Digital dopaminergic projections?

@martinsmouter9321 4 жыл бұрын

4:24 it doesn't have to be smarter just good enough to be less rewarding than it's environment

@notoriouswhitemoth 4 жыл бұрын

Humans have multiple reward functions. We have dopamine, that rewards setting goals and accomplishing those goals, particularly related to our needs; serotonin, that rewards doing a thing well, impressing other people with displays of proficiency; and oxytocin, that rewards kindness and cooperation.

@armorsmith43 3 жыл бұрын

> We have dopamine Some of us do... :_(

@JimGiant 5 жыл бұрын

Hmm, what if the reward is split in to two parts, the purpose (eg. get a high score in SMB) and do it in a way which doesn't displease the owner. Have the owner indicating they are happy part be worth more than the possible score for the purpose his way any discovered attempt at reward hacking will result in a lower score than any attempt to do it correctly. With more powerful AI which has the power to threaten or coerce the owner have a third reward layer bigger than the other two combined which rewards the AI as long as it doesn't interfere with the owner's free will.

@Paul-rs4gd 5 жыл бұрын

I am interested in the idea that Reward Hacking is essentially the opposite argument to "Take a pill so you will be happy after killing your kids" - TPHKK for short. One view argues that an AI will try to modify its reward function, while the other argues that the AI will resist attempts to modify the function. I would like to hear more discussion of this conflict. My own 2 cents: It seems to me that the simple RL algorithms 'feel' their way down a gradient resulting from actions THEY HAVE ACTUALLY TAKEN in the real world. If the set of actions required to Reward Hack was sufficiently long it is vanishingly unlikely that sequence would be executed by chance, since no gradient would lead the RL along that path (there is no reward until the Hack has been successfully executed). The situation is entirely different in an AI that models the world and plans using that model. Such an AI could understand that it is implemented on a computer and plan to modify it. However it should see that the plan results in a change to its reward function and does not achieve its current goals. Therefore like humans being resistant to TPHKK, the AI should not wish to execute the plan.

@tomyao7884 6 жыл бұрын

But for the last example, won't the AI just try to modify itself into tricking itself into believing that the messes will magically disappear and not exist if it doesn't see it? (like in the simpsons clip)