We Were Right! Real Inner Misalignment

Рет қаралды 252,315

Robert Miles AI Safety

Күн бұрын

Пікірлер: 1 500

@vwabi 3 жыл бұрын

AI safety researchers are absolutely the last people on earth you want to hear "We were right" from.

@madshorn5826 3 жыл бұрын

And climatologists.

@Laszer271 3 жыл бұрын

@@madshorn5826 Nah, epidemy can destroy the world in months, climate change can in decades. Superinteligent AI could probably destroy it before lunch :P

@donaldhobson8873 3 жыл бұрын

What about "we were totally wrong, the problem is much worse than we thought it was."

@madshorn5826 3 жыл бұрын

@@Laszer271 Well, destroyed is destroyed. Or are you the type not bothering with insurance and health check ups because a hypothetical bullet to the brain would rather quickly render those precautions moot?

@Laszer271 3 жыл бұрын

@@madshorn5826 fair enough. It was all a joke though. But in your example, I still think "I just got a bullet to the brain" is worse than "I just got diagnosed with cancer". Maybe bullet is less likely, sure, but we were talking about the time that the danger was already proven, right? I think it's plausible that probability of my survival is greater conditioned on "we were right" statement being made by epidemiologist, climatologist or oncologist than it is conditioned on the same statement made by AI safety expert or like bullet...ologist.

@llucos100 3 жыл бұрын

Turns out the Terminator wasn’t programmed to kill Sarah Connor after all, it just wanted clothes, boots and a motorcycle.

@Alorand 3 жыл бұрын

And ended up becoming the governor of California instead...

@spejic1 3 жыл бұрын

@@Alorand Becoming governor of California gets you MANY clothes, boots, and motorcycles.

@sevdev9844 3 жыл бұрын

Or making John Connor into a boyfriend. (You might think of Arnie when Terminator comes up, I think of Summer aka Cameron)

@Saka_Mulia 3 жыл бұрын

That's Terminator goals ... not termianl ... oh never mind ... i get it

@quitequiet5281 3 жыл бұрын

LOL Yup... in retrospect with this paper... the terminator was a pursue bot... driving a threat variable towards the development and improvements of a General Artificial Intelligence and look at all the upgrades that series of pursuit bots facilitated. LOL

@ShankarSivarajan 3 жыл бұрын

10:54 "It actually wants something else, and it's capable enough to get it." Yeah, that _is_ worse.

@Encysted 3 жыл бұрын

The AI *does* in fact know how to drive a car, and it never really learned not to hit people.

@Rotem_S 3 жыл бұрын

@@Encysted or it learned how not to hit people, but hits them whenever there are no witnesses because it only cares about turning right

@InfinityOrNone 3 жыл бұрын

@@Rotem_S Or it learned not to hit people because it really cared about maintaining the present state of the paint job, which was white in the training environment. But the deployment environment uses a _red_ car.

@InfinityOrNone 3 жыл бұрын

@@Rotem_S Wow, your user name confuses the comments section.

@xelspeth 3 жыл бұрын

@@InfinityOrNone It doesn't. It just display in the correct (right to left) reading direction that hebrew uses

@bierrollerful 3 жыл бұрын

Almost sounds like AIs will need psychologists, too. "So I tried to acquire that wall..." "Why not the coin? What is it about the wall that attracts you?" "Well, in training, I always went to the... oh...huh, never thought about it that way."

@crubs83 3 жыл бұрын

AI safety researchers ARE psychologists as far as im concerned.

@PMA65537 3 жыл бұрын

I was coping ok before the awful behaviour of that other AI used by the Shah of Lugash.

@lobrundell4264 3 жыл бұрын

this made me smile : D

@ChrisBigBad 3 жыл бұрын

I clearly remember a Civ-Type game, where one of the research-items was "AI without personality problems"

@bierrollerful 3 жыл бұрын

@@ChrisBigBad Sounds like research an AI with personality problems would try.

@unvergebeneid 3 жыл бұрын

Famous last words for species right before they hit the great filter: "Yo, in the test runs, did paperclips max out on the positive attribution heat map, too?"

@michaelpapadopoulos6054 3 жыл бұрын

There are so many layers to this comment and I love it.

@underrated1524 3 жыл бұрын

I keep hearing the notion of AI being the great filter, but I can't say I buy it. Not that AGI isn't an existential threat, because it absolutely is. It just can't explain why we don't see any signs of aliens when we look up at the sky, because if the answer is "AGI", then that begs the question: "Okay, so why don't we see any of those, either?"

@AwfulnewsFM 3 жыл бұрын

@@underrated1524 what if agis prefer to kill their creators and enter some deep bunker in some Rouge planet to await heat death after reward hacking their brains. Still dosent explain why they are aren't here preparing to kill us.

@unvergebeneid 3 жыл бұрын

@@underrated1524 I agree. Especially the paperclip optimizer should show itself in the form of huge paperclip-shaped megastructures around distant stars. It still made for a good joke though, if I do say so myself.

@sageinit 3 жыл бұрын

[Laughs in Grabby Aliens, Synthetic Super Intelligence, Gaia Hypothesis, Global Brain, & Planetary Scale Computation]

@rofl22rofl22 3 жыл бұрын

Robert Miles: "We were right" Me: Oh no "About inner misalignment" OH NO

@LeoStaley 3 жыл бұрын

Yeah. The only thing worse is, we were right about AI being deceptive about its goals during training before deployment.

@JM-us3fr 3 жыл бұрын

@@LeoStaley Or even worse: We were right about AI being more dangerous than nukes

@MetsuryuVids 3 жыл бұрын

@@JM-us3fr That's almost certain.

@LeoStaley 3 жыл бұрын

@@JM-us3fr oh no, that's absolutely going to be true at some point. The only real question is, can we stop them from deciding to (even accidentally) kill us? Can we even avoid making them accidentally WANT to kill us because we accidentally fucked up the training environment?

@ARVash 3 жыл бұрын

@@JM-us3fr Nukes are safe because they kill people you don't want dead. I'd say an AI is definitely more dangerous because it has much more capacity to be selective. It could also be safer, really depends on the implementation details, much like a person. A person can be safe, or dangerous. Can we even avoid making a human accidentally want to kill us because we accidentally fucked up the training environment? Maybe.

@proskub5039 3 жыл бұрын

A coin isn't a coin unless it occurs at the edge of the map! We may think the AI is weird for ignoring the heretical middle-of-the-map coin, but that's just our object recognition biases showing.

@GigaBoost 3 жыл бұрын

Literally this haha

@sabelch 3 жыл бұрын

Great interpretation! But it doesn't seem to explain why the AI goes to the edge of the map even when there isn't a coin there.

@GigaBoost 3 жыл бұрын

@@sabelch it still seemingly learns to favor walls, if you look at the heatmaps. Perhaps without the coin all it has to go by with positive value is the walls.

@proskub5039 3 жыл бұрын

@@GigaBoost Yes, the salient point here is that we should not assume that the AI interprets objects the way we would. And any randomness in the learning process could lead to wildly different edge-case behaviors..

@GigaBoost 3 жыл бұрын

@@proskub5039 absolutely!

@bartman999 3 жыл бұрын

Nothing more terrifying than seeing the title 'We Were Right!' on a Robert Miles video.

@captainufo4587 3 жыл бұрын

In a way, yes. In another way, up to this point there was a debate whether AI safety was a real concern worth investing research, time and money, or just overworrying. It's a good thing that these demonstrations prooved it's the former, and that they happened this early in the history of AIs.

@moartems5076 3 жыл бұрын

Looking at my hoard of keypicks in skyrim, i can confirm, that this is perfectly human behavior.

@OMGclueless 3 жыл бұрын

When you think about it, yeah, it's very human-like. Kind of like gambling addicts who know that they're losing money when they play but have trained themselves to like the feeling of winning money rather than the ultimate goal of a comfortable happy life or even the instrumental goal of having money.

@threeMetreJim 3 жыл бұрын

Definitely, what is wrong with collecting as many keys as possible if you want to open as many chests as possible, and each requires a key? In a maze you don't know what is round the corner in advance. Trying to collect your own inventory is simply a programming error if the agent can see the part of the screen that is designed as a guide for a human to observe the progress.

@Spellweaver5 2 жыл бұрын

@@threeMetreJim yes, but not trying to open the remaining chests is definitely the goal learned wrong.

@sharktrap267 Жыл бұрын

@@threeMetreJim if my AI is built to keep my wood storage at a certain level by collecting wood in my forest but it learnt to "collect all the keys"(all the wood), my forest will soon become a plain. It's an issue, because growing trees takes time, wood takes storage space and any wood not protected can become unsuitable for the usage. You're not just wasting ressources, you're also at risk of not having wood available at some point. And if you use the forest to hunt too, you can start learning to hunt in a plain. So depending on the goals and situation, hoarding can lead to issues

@charliesteiner2334 3 жыл бұрын

9:00 "We developed interpretability tools to see why programs fail!" "What's going on when they fail?" "Dunno." No shade, interpretability is hard, even for simple AI :P

@YuureiInu 3 жыл бұрын

It just likes the coins next to the end wall. Why would you teach it to like only those and expect it to get any other coins?

@SimonClarkstone 3 жыл бұрын

It reminds me of koalas that can recognise leaves on plants as food, but not leaves on a plate.

@gabrote42 3 жыл бұрын

@@SimonClarkstone interesting

@Bacopa68 3 жыл бұрын

@@SimonClarkstone AI HAS ADVANCED TO THE KOALA LEVEL. REPEAT, KOALA LEVEL. Ah, so basically nothing then.

@raskov75 3 жыл бұрын

And the more complex these systems get, the harder it becomes. Oi vey.

@SummerSong1366 3 жыл бұрын

Let alone simple AI, _people_ get misaligned like that quite often - hoarding is one good example, which happens both in real life and in games like with those keys.

@nikolatasev4948 3 жыл бұрын

It keeps amazing me how AI problems are increasingly becoming general human problems. "if we give a reward to the AI when it does a job we want, how do we stop it from giving itself the award without the job" - just as humans give themselves "happiness" with drugs. "how do we make sure the AI did not just pretend to do what we wanted while we were watching" - just as kids do.

@sonkeschmidt2027 3 жыл бұрын

@@nikolatasev4948 which is why eventually ai research will have to dive into religion/spirituality. Those where the only successful attempts humans made to solve the general problems that we have. Not saying that all of them where successful, life always moves on, there is always growth and decay/change. But every now and then they generated "the solution" to everything, rippling down to millions and billions of people trying to imitate that.

@markusmiekk-oja3717 3 жыл бұрын

@@sonkeschmidt2027 I would claim religion does not help with that type of problem.

@sonkeschmidt2027 3 жыл бұрын

@@markusmiekk-oja3717 then I invite you to look at what religion does. Functional religion, I'm not talking about what you know or have heard about it going wrong, in talking about the cases where it does work (which are those you never hear of because... Well because they work, they don't cause trouble but bring stability, that doesn't make news). If you look into that you understand why religion is a global phenomenon and why it has the power it has. If you feel with scientists you will also find that the West doesn't have stopped being religious, they just rebranded it and called it science. We live in a world with a huge amount of uncertainty and where mistakes can have huge negative consequences. Humans can't deal with that without a working believe system. You have tons of these you just wouldn't consider them religious probably. That will change, should life ever show you the scope of uncertainty there is. Good luck making it though without a (spiritual/religious) belief system that is in alignment with the society you life in. =)

@nikolatasev4948 3 жыл бұрын

@@sonkeschmidt2027 Well, the video about Generative Adversarial Networks with an agent trying to find flaws and break the AI we are training gave me strong Satan vibes. But apart from that I don't think we need further research into religion/spirituality. Simply put they work on us, a product of long evolution in specific environment. We need a more general approach, since AIs are a product of very different evolution and environment. Some solutions for the AI may resemble some religious notion, just as some scientific theories resemble some religious ideas, but trying to apply religion to AI is bound the fail just as applying religion fails in science.

@goonerOZZ 3 жыл бұрын

Somehow the terminal and instrumental goals talk made me correlate the AI with us. As a financial advisor, I have found that many people also made this mistake that money is an instrumental goal, but having spend so much time working to get money, people start to think that money is their terminal goal so much so that they spend their entire live looking for money forgetting why they want to have the money in the first place.

@anandsuralkar2947 3 жыл бұрын

True

@MenwithHill 3 жыл бұрын

Very much the same feeling on my end. I actually found it cute when the chest opening AI just started collecting keys.

@lennart-oimel9933 3 жыл бұрын

The reason why I watch this Channel is mostly because you can correlate almost every video to human intelligence. And it makes sence: Why should'nt the same rules apply to us that apply to AI? I see this Channel as an analyses of the problems of intelligence in general. Not only the ones we make;)

@GrilledCheeseSandwich1 3 жыл бұрын

It seems like no one realized that this idea is hinted at by the song in the outro: Jessie J - Price Tag. The most famous line from the song is: It's not about the money, money, money

@jackren295 3 жыл бұрын

@@lennart-oimel9933 Me too. After watching this channel, I started to agree with the notion of "making AI = playing god" that I've heard sometimes in the past. At first, I didn't put too much thoughts on it. But now I've realized that making powerful AGIs that are safe and practical requires us to know all the weaknesses of the human mind, and make a system that avoids all these weaknesses while still performing at least as well as we can. It's like making the perfect "human being" in some sense.

@RichardEntzminger 3 жыл бұрын

I feel like this isn't just a problem with artificial intelligence but intelligence in general. Biological intelligence seems to mismatch terminal goals and instrumental goals all the time like Pavlovian conditioning training a dog to salivate when recognizing a bell ringing(what should be the instrumental goal) or humans trading away happiness and well being (what should be the terminal goal) for money (what should be an instrumental goal).

@Racnive 3 жыл бұрын

Organizations founded with the intent of doing X end up instead doing something that *looks like they're doing X*, because that's what people see; that's what people hold them accountable to. It doesn't even take intelligence: Evolution by natural selection doesn't require any intelligence to winnow things away from what they "want" (terminal goals, should they exist), toward what will survive/replicate (at least in principle, an instrumental goal).

@salec7592 3 жыл бұрын

I concur with this. The problem is not AI specific and should be termed something along lines of "general delegation problem" or problem of command chain fidelity. The subset of which is Miles' nightmare with inverted capability hierarchy, where command is passed by less able actor to more able actor (e.g. a human to an advanced AI).

@Sindrijo 3 жыл бұрын

@@salec7592 Even if with prefect interpretability of each composite of an AI (e.g. the layers in a neural network) ulterior goals might still be encrypted into looking 'good'. An AI command structure with short circuiting breaks in the reward-loop might help. E.g. you will have people issuing commands/goals to an interpreter AI which interprets and delegates those commands to another AI (without knowing if it is delegating to an AI or not) reduce the chance for goal-misalignment by reducing the impact of the complete-loop feedback with shorter feedback loops, also randomly substitute each composite part of the command-delegation chain during training.

@sonkeschmidt2027 3 жыл бұрын

Is that a problem though? Or isn't good what makes life possible in the first place? After all if you want to solve the problem that is life, then you just kill yourself. All problems solved. But then you can't experience life. So live needs decay in order to create new problems so that something new can happen. Needing in the sense that existence can only exist as long as it exists. Without existence you don't have problems but you don't have existence either.

@nahometesfay1112 3 жыл бұрын

@@sonkeschmidt2027 I might sound sarcastic, but the following questions are sincere. Do you think it's ok for AI to take over the world? Perhaps even drive humanity to extinction? Humans have done the same to other species even other humans and humans are not unique from the rest of life in this respect. As you said decay makes way for new life. I think humanity should be preserved because I find destruction in general unsettling. To be clear I'm not saying you are wrong or that you believe what I just said. I'm just wondering how your ideas extend in these topics Edit: typing on my phone so I missed some other stuff: do you think existence is better than non-existence? To me non-existence is neutral. Do you think humans have a moral imperative to maintain their existence? Do you think humans need to go extinct at some point so that reality can continue to change? You brought up some very interesting ideas and I just wanted hear more of your thoughts.

@YuureiInu 3 жыл бұрын

"Can you spot the difference?" Pauses the video and looking for the difference....nothing. Unpause. "You can pause the video." Pauses again and manically looking for a pattern. More keys? "There's more keys in the deployment. Have you spotted it?" Yes!!!!

@thefakepie1126 3 жыл бұрын

@Impatient Imp I've counted 12

@-41337 3 жыл бұрын

imagine a future where a very trusted ai agent seems to be fantastically doing its job well for many months or years, and then suddenly goes haywire since its objective was wrong but it just hadn't encountered a circumstance were that error was made apparent. then tragedy!

@TulipQ 3 жыл бұрын

I doubt it will be a grand revel. People will die due to a physical machine, these interpreter tools can then be used to argue the victim did something wrong, that a non AI system did the fault, or that a human supervisor was neglegent. The deployment enviorment is one full of agents optimized for avoiding liability.

@CyborusYT 3 жыл бұрын

That's actually not that far from normal computer systems There are countless stories of a system (ordinary computer system) suddenly reaching a bizarre edge-case and start acting completely insane

@NoName-zn1sb 3 жыл бұрын

@@TulipQ negligent

@gastonmarian7261 3 жыл бұрын

Like when we designed computers without thinking / knowing about cosmic ray bit flips, so decades later a plane falls out of the sky because their computer suddenly didn't know where it was in the sky. Humans are a trusted ai agent deployed in a production environment with limited understanding of what's going on

@demoniack81 3 жыл бұрын

@@CyborusYT Yeah, it happens literally all the time. It's just that usually the error gets caught somewhere along the way, an exception is thrown, and the process is terminated. Which is where you get the error page and then pick up the phone and go talk to an actual person in customer service who can either override it or get the IT team to fix the problem.

@Houshalter 3 жыл бұрын

Imagine training a self driving car in a simulation where plastic bags are always gray and children always wear blue. It then happily runs down a child wearing gray, before slamming on the brakes and throwing the unbuckled passengers through the windshield, for a blue bag on the road.

@nullone3181 3 жыл бұрын

The brat in gray was asking for it

@GetawayFilms 3 жыл бұрын

Imagine training a self driving car to the point where it can competently navigate complex road systems, yet can't remain stationary until all passengers are buckled up...

@Houshalter 3 жыл бұрын

@@GetawayFilms cars sold today only flash a warning light/noise if you don't buckle, and only because government regulations mandate it. Even then most people disable it

@GetawayFilms 3 жыл бұрын

@@Houshalter so what you're saying is . It's a 'people' thing... Ok

@sonkeschmidt2027 3 жыл бұрын

Humans do that all the time. Except that we have a deep genetic imperative to recognise children and to protect them but there are loads of examples where these instincts are overwritten....

@ARVash 3 жыл бұрын

An interpreter, a mind reading device, once you read it and respond becomes a way for an agent to "communicate" with you and they can communicate things that give an impression that hides their actual goal. A lot of these challenges arise when training or coordinating humans, and it's somewhat unsurprising that while a mind reading device might seem to help at first, it's not going to be long before someone figures out how to appear like they're doing the right thing, while watching tv.

@saxy1player 3 жыл бұрын

Great idea!

@Voshchronos 2 жыл бұрын

Very well put

@Practicality01 3 жыл бұрын

This is starting to get an "unsolvable problem" vibe. Like we are somehow thinking about this in the wrong way and current solutions aren't really making good progress.

@michaeljburt 3 жыл бұрын

Very much so. The psychology of teaching/learning as humans isn't really understood. What *actually* happens when you learn something new for the first time? Feedback on that process is vital. How do you give a machine feedback on what it learned, when you don't know what it learned exactly? It can't communicate to us what it "felt" it learned. In others words, human says: "I said the goal was X". Machine says: "I thought the goal was Y".

@AfonsodelCB 3 жыл бұрын

@@michaeljburt realize: we actually want these things to be much better than humans. but we might be underestimating how maxed out humans are at certain things. humans have goal missalignments all the time, and many aren't detected for years

@josephburchanowski4636 3 жыл бұрын

"This is starting to get an "unsolvable problem" vibe. Like we are somehow thinking about this in the wrong way and current solutions aren't really making good progress." Welcome to AI Safety. The best part is that if we don't solve the "unsolvable problem", we might all die. Along with all life on Earth, along with all life in the galaxy, along with all life in the galaxy cluster. And with cannibalizations of all planets and stars for resources for some arbitrary terminal goal. A potential outcome is a dead dark chunk of the universe built as a tribute to something as arbitrary as paper clips or solving an unsolvable math problem.

@sonkeschmidt2027 3 жыл бұрын

Aren't we touching the biggest unsolvable problem in existence? Existence itself? Think about how terrifying it would be if you could solve every problem, if you could solve life. That means there would be an absolute border that you would be infinitely stuck with... Sounds better to me that there will always be a new problem to be solved...

@AlejandroMarin.design 3 жыл бұрын

Alignement in humans is solvable. I developed a methodology to do it easily and quickly. So I think alignment in machines is solvable. I’ve actually designed the methodology to serve machine alignment as well. We’ll get there, don’t despair.

@andrewweirny 3 жыл бұрын

This is one of your clearest and most interesting videos to date. I'm now very excited for the interpretability video!

@JabrHawr 3 жыл бұрын

a viewer's comment from 2 days ago despite the video having been published just few hours ago. you must be a patron, or an acquaintance

@andrewweirny 3 жыл бұрын

@@JabrHawr the former.

@michaeljburt 3 жыл бұрын

Agreed. Exciting stuff

@thecakeredux 3 жыл бұрын

The thought of creating a capable agent with the wrong goals is terrifying, actually; and yes, an agent being bad at doing something good is absolutely a problem much preferable to an agent being good at doing something bad.

@xxxJesus666xxx 3 жыл бұрын

speaking of A.I. or psychology?

@gadget2622 3 жыл бұрын

@@xxxJesus666xxx yes

@ThrowFence 3 жыл бұрын

Isn't this exactly what's happening with mega corporations?

@sharpfang 3 жыл бұрын

Reminds me of the elections a couple years ago in Poland. A very competent and capable, but thoroughly corrupt and evil political party was voted out and replaced with a party just as corrupt and evil but vastly less competent.

@thecakeredux 3 жыл бұрын

@@sharpfang That unironically is an improvement in today's political landscape. If I'd have to choose a form of evil, it'll always be the less capable rather than the less sinister.

@Turtle76rus 3 жыл бұрын

Can't wait for the "We Were Right! Real Misaligned General Superintelligence" video

@michaelspence2508 3 жыл бұрын

One more sentence and this would be the scariest Two Sentence Horror Story I've ever seen

@unvergebeneid 3 жыл бұрын

Now here's a reason to actually "hit that bell icon" if I've ever seen one. Because the time window to watch that video would be rather small I imagine 😄

@PetardeWoez 3 жыл бұрын

probably the last video ever made on the topic

@Zeekar 3 жыл бұрын

The question: which takes longer? Uploading a video to KZbin or the entire world being converted to stamps?

@christiangreff5764 3 жыл бұрын

@@Zeekar Teh former. At the point that video would be produced, we would have our ands full with with fighting the mechanical armies of the great paperclip maximiser (and it would have probably hacked and monopolized the internet to limit our communication channels).

@Huntracony 3 жыл бұрын

Did you intentionally use the "It's not about the money" song for the video about the AI not going for the coins? Either way, that's quite funny. Well done.

@PhoebeLiv 3 жыл бұрын

His song choices are always amusingly on the nose, actually! A few off the top of my head are "the grid" for his gridworlds video, "mo money mo problems" for concrete problems in AI safety, and "every breath you take (I'll be watching you" for scalable supervision

@Huntracony 3 жыл бұрын

@@PhoebeLiv Nice! Hadn't noticed before, but I'll definitely start paying some closer attention form now on.

@thewrongjames 3 жыл бұрын

Another on the nose choice was Jonathan Coulton's "It's Gonna be the Future Soon" on the video about what AI experts predict will be the future of AI.

@matthewwhiteside4619 3 жыл бұрын

He also used "I've got a little list" in one of his list videos.

@SpoonOfDoom 3 жыл бұрын

I didn't catch that, that's great!

@EebstertheGreat 3 жыл бұрын

It looks like in the keys and chests environment, the AI was trying to get both keys and chests, but it was strongly prioritizing keys. When there were more chests than keys, it was always spending its keys quickly, so it never ended up with a bunch in its inventory. As a result, it never learned that keys at the left edge of the inventory were impossible to pick up, so it just got stuck there trying to touch them, since they were more important than the remaining chests.

@isaacgraphics1416 3 жыл бұрын

it's the same problem evolution ran into when optimising our taste palate. Fat and sugar were highly rewarded in the ancestral environment, but now we live in a different (human created) environment, that same goal pushes us beyond what we actually need and creates problems for us.

@silphonym 3 жыл бұрын

@@isaacgraphics1416 It's really cool and scary to think of how this stuff applies to our natural intelligence as well.

@ohjahohfrick9837 3 жыл бұрын

@@silphonym Well both came about from essentially the same process.

@custos3249 3 жыл бұрын

Well, pardon my comparison, but you've effectively found an adjunct to heuristic behavior based on sensory inputs like "things that taste sweet are good" and ending up with a dead kid after they drink something made with ethylene glycol. If it's always operating on heuristics, you'll never be sure it's learned what you intended, arguably even after complex demonstrations, given the non-zero chance of emergent/confounding goals. But, relative to human psychology at least, that's not a death sentence - weighting rewards differently, applying bittering agents, adding a time dimension/diminishing reward overtime jump to mind to trying to at least get apparent compliance. Besides, if the goal is "get the cheese," it needs to able to sense and comprehend "cheese," not just "yellow bottom corner good."

@saxy1player 3 жыл бұрын

I'm not sure I understand you completely, but that IS the biggest problem with these 'intelligent' systems. We have no idea (let's not kid ourselves) how they work. But we are happy when they do what we want them to. Let's not think about what happens when we let these kind of systems act in the world in a broader sense and live happy until then xD

@jeremysale1385 3 жыл бұрын

The ability to slow down and switch into more resource-intensive system 1 thinking when a problem is sufficiently novel is how humans (sometimes) get around this heuristic curse. I wonder if there is some analog of this function that could be implemented in machine learning.

@ChaoticNeutralMatt Жыл бұрын

@@jeremysale1385 I imagine that will be the case eventually.

@pumkin610 Жыл бұрын

Humans can chase things that seem appealing to us based on what we learned, but we can also choose to pursue a random/ painful goal just because we want to, sometimes we just don't know the negative ramifications of an action, and sometimes we believe things that aren't true.

@custos3249 Жыл бұрын

@@pumkin610 Neat. Bet that can still be reduced to and restated as "novelty is good." No matter what goal, drive, etc. you can come up with, it can be put in simple approach/avoidance terms, even seemingly paradoxical behavior. It all comes down to reward.

@SocialDownclimber 3 жыл бұрын

It always blows my mind how directly and easily these concepts relate to humans. It really goes to show that all research can be valuable in very unexpected ways. I expect that these ideas will be picked up by philosophy and anthropology in the next few years, and make a big impact to the field.

@JamesPetts 3 жыл бұрын

I shall very much look forward to the interpretability video - this should be very interesting.

@offchan 3 жыл бұрын

It's the problem of vague requirement. It's similar to when you tell someone to do something but they do the wrong thing. Human solves this by having similar common sense as another human and use communication to specify stricter requirement.

@ГеоргиГеоргиев-с3г 3 жыл бұрын

Yes, "give me a thing which looks like that other thing i mentioned earlier" in a room full of junk(without additional context), have had that problem.

@dsdy1205 2 жыл бұрын

Actually humans 'solve' this by having a reward function (emotions) that are only vaguely and very inconsistently coupled with reality, while mounting the whole thing in a very resource intensive platform where half the processing capability is used just to stay alive, and modifying itself is so resource intensive that most don't even try. And even then, we manage to inflict suffering to millions if not billions, so I'd say this isn't really solved either

@cornoc Жыл бұрын

@@dsdy1205 yeah, i'm starting to think this is a fundamental problem that can't be removed, and that the only reason we aren't as worried about the same thing with humans is that the power of any particular human being is limited by the practical constraints imposed by their physical body and brain power. when you give the same type of rationality engine to a super powerful being, all kinds of horrible things are going to happen. just look at any war to see how badly a large group of humans led by a few maniacs can fuck up decades of history and leave humanity with lasting scars for centuries or more.

@Tutorp 3 жыл бұрын

Hey, the key-AI works kind of the same way most people do when playing computer games... "Oooh, shiny things I don't need all off? I need them all! Game objectives? Meh..."

@johnno4127 3 жыл бұрын

I realized I experience misalignment do to poor training data every couple weeks. . I work as a courier delivering packages in Missouri, USA, and I often meet people at their homes or workplace. Unfortunately, I don't learn their names as attached to their faces, but rather as attached to locations so that when I meet them someplace else I can't remember their names easily (if at all).

@mscout1 Жыл бұрын

I had someone from my TableTop club say 'hi' to me in the gym. No idea who it was, because my brain was searching the wrong bucket of context.

@ozql 3 жыл бұрын

I'm glad we found this out now, and not, you know, in deployment. Ever grateful for AI safety researchers!

@picksalot1 3 жыл бұрын

That was very interesting. Humans often make the same kinds of mistakes when given instructions. Assumptions that word definitions mean the same thing to different people is often the case, but not always. Context can change the interpretation of the instructions. Part of the context is that the instructor knows and understands the goal more thoroughly than the one being instructed, even though it may appear the same. Trying to determine the number of necessary instructions to reach the desired goal, while avoiding all other negative outcomes, is an interesting problem when the species are different. Maybe it would work better if humans learned to think like machines instead of trying to get machines to think like humans. That way, the machines would get "proper" instructions. It looks like that is what the "Interpretability Tool" is designed to do.

@9600bauds 3 жыл бұрын

It's easy enough to have the AI tell you what it "wants" - inside an environment. What you need to know is what it wants *in general*, which is a lot harder. This is why the insight tool isn't very insightful: it's showing you what the AI wants in the current environment, but it doesn't bring us a lot closer to understanding *why* it wants those things in that environment. The solution? Idk lol

@AscendantStoic 2 жыл бұрын

Is there even a why at this point without the A.I having free will or self-awareness?. Like aren't we the ones reinforcing its interactions or downplaying them with the different objectives in the environment to teach it what to go for and what not to do?, if it goes for key or coin we put emphasis on it as positive interaction it should do more of, if it hits a buzzsaw we point it out as a negative thing it should do less of, until it learns it needs to get the coin and avoid the buzzsaws.

@ChaoticNeutralMatt Жыл бұрын

@@AscendantStoic It sounds easier than it actually is, basically. You can certainly try, but there is still the uncertainty of what it actually learned.

@charaicommenternotalt 11 ай бұрын

@@AscendantStoic It doesn't NEED self awareness. For example in an AI that is trained to recognize cats and dogs, there is still a sort of 'why' it thinks this picture is a dog and not a cat, even though it is not conscious or anything. And also the problem is that it's very hard to teach an AI what we want it to do. If we tell it to get a coin it may learn to do another goal entirely, unbeknownst to us, that still gets the job done. The problem is when it fails and we realize it's learning a different goal. I think the solution is having the AI learn multiple tasks.

@McMurchie 3 жыл бұрын

When i first got into AI about 12 years ago, I had encountered these goal misalignment problems way before Rob mentioned them (great vid btw) - however in the time since i've become convinced, as long as we continue to rely on neural networks we will never move towards trustworthy or general AI.

@euged 3 жыл бұрын

Would you be able to share some thoughts on what alternatives would be better? Thank you

@totalermist 3 жыл бұрын

It's fascinating how researchers still insist on using black-box end-to-end models when hybrid approaches could be so much safer and more predictable (in cases where you actually want that, e.g. self-driving cars, code generation and the like). Why aren't self-driving systems combined with high-level rule-based applications so they don't "do the wrong thing at the worst possible time" (quoting Tesla here)? Why don't OpenAI's Codex and Microsoft's Co-Pilot include theorem provers and syntax checkers in their product? ¯\_(ツ)_/¯

@McMurchie 3 жыл бұрын

@@totalermist fully agree - i'm working on these approaches now; to be honest, I think we are just ahead of our time. In 10 years time everyone will have move to hybrid solutions or something further afield.

@IrvineTheHunter 3 жыл бұрын

@@totalermist To make a meme, "humans don't learn to speak binary" robots do not see and work through the world on a human level, it's like teaching an octopus algibra or a mantis shrimp art, no matter how smart, or how great their eyesight is, they don't preceive things as humans do. Look at how hard it is for AI's to recognize a car or cup or dog, these things are abstract bundles of details that the human brain can lump together but is very hard for a hard system. For example define a cup, describe is simple language a set of rules that would apply to every cup in the world. People collectively understand cups so it shouldn't be hard.... Now we would have to build an AI with similar rationalizations not based on computer logic, but human logic, and it's great. It's just a matter of building it Allen Turing thought we could do it and it would be easy, but decades of experience have proven him wrong because it's simply to program a machine to think like a human, we however CAN program it to lean and TEACH it like a human. Is it' falliable, of course so are humans, games AI are made from AI blocks that interact and they are still choked full of mistakes, that is too say, even when the program intuitively understands things like a person in the real world they still shit the bed. kzbin.info/www/bejne/q2bapaJ-ZcR-q6M is a really great example of AI bugging out because something in it's world went wrong. Some talk from Tom Scott why computers are dumb kzbin.info/www/bejne/m6LZc5SgbbqMsJY

@crowlsyong Жыл бұрын

thank you for emailing some of those people and asking questions. that's great getting stuff direct from source.

@rentristandelacruz 3 жыл бұрын

Now we need an intepretability tool for the interpretability tool.

@badwolf4239 3 жыл бұрын

We heard you liked interpretability, so we made an interpretability tool for your interpretability tool so you can interpret while you interpret. Now go ask your chess playing AI why it just turned my children into paperclips.

@josephburchanowski4636 3 жыл бұрын

@@badwolf4239 It told me that it was showcasing its abilities so it can convince human opponents to resign. Researching misaligned AI examples, it tried deciding what way of transforming someone's children would be the most intimidating. It was a choice between paper clips, stamps, and chess pieces. Also there was some mention it was contemplating turning them into human dogs hybrids. I don't know why. Something dealing with a bunch of people have trauma about a Nina something.

@christiangreff5764 3 жыл бұрын

@@josephburchanowski4636 At least it did not develop a shap shifting clown body in order to eat them ...

@GamesFromSpace 3 жыл бұрын

Just to be safe, start including pictures of human skulls when doing a pass with those interpretability tools.

@mhelvens 3 жыл бұрын

Ah, we're noticing negative attribution when they are surrounded by skin, but positive attribution when they are piled up with a throne stacked on top. I wonder what this means. 🤔

@Swingingbells 3 жыл бұрын

AI agent: \*stomp\*

@lilDaveist 3 жыл бұрын

@@Swingingbells If picture == human skull: Action = None Ai: „If picture == Human Skull; Action = Double stomp“ „Gotcha“

@arvidhansen5892 Жыл бұрын

Well what if the ai wouldn't even have considered obtaining human skulls before and just by introducing them to it, you just screwed up big time

@clayupton7045 3 жыл бұрын

any chance that it only likes coins that are in _| corners and it treats moving up and right as an instrumental goal?

@julianatlas5172 3 жыл бұрын

Thanks for the clarification of what a corner looks like haha

@drdca8263 3 жыл бұрын

@@julianatlas5172 I think they were distinguishing from e.g. |_ corners, not just giving a demonstration of what corners are

@JohnJackson66 3 жыл бұрын

It seemed to me that it had learned the most likely location for a coin in the training. It seems obvious to me that training should have more variability than deployment or it is bound to fail.

@fieldrequired283 3 жыл бұрын

@@JohnJackson66 The problem is that this whole setup is a simulation of how we want real AI to operate. If you're training an AI for an actual purpose, you will likely be deploying it in a system that interfaces somehow with the real, outside world. And the Real, Outside World will almost *certainly* be more complicated than any training simulations you come up with. After all, The Real World _includes_ you and your simulations. These tests are deliberately set up so deployment is slightly different from training so we can see what happens when the AI is exposed to novel stimuli, and the fact that it didn't learn what we thought it did in training is a Problem. In the real world, not all the cheese is yellow, not all the coins are in corners, and there will always be more complications than we plan for.

@ZT1ST 3 жыл бұрын

@@JohnJackson66 The problem from an AI Safety point is that, well...you can't know if you have enough variability in your training. These test cases are ideal for testing how to fix that problem before it becomes a situation like @Field Required mentioned - you want a simple solution that scales up from this into the solution where we don't necessarily have to worry about every single possible variable in deployment.

@leow.2162 3 жыл бұрын

Is there a chance that very high level AIs will learn to expect the use of interpretability tools and use them to make us think they are better/more safe then they are?

@IrvineTheHunter 3 жыл бұрын

I can't remember which video it was, but I believe he did mention this with a super AI "safety button*", 1 If the AI likes the button, it will act unsafe to trigger it, 2 if it doesn't like the button it will avoid behaviors OR AND stop the operator from pressing the button, if it doesn't know the button and it's smart enough it will figure out the likely existence and placement, see point two. *a force termination switch of any kind. In short, yes, because while an AI may not be "alive" it want it's goal and will alwayse act to achieve said goal.

@artemis_fowl44hd92 3 жыл бұрын

@@IrvineTheHunter It's on the computer phile channel and is called 'AI "Stop Button" Problem - Computerphile'

@AssemblyWizard 3 жыл бұрын

Not necessarily. There are some tests that you can't spoof no matter how smart you are, and even if you know they're coming.

@ГеоргиГеоргиев-с3г 3 жыл бұрын

@@AssemblyWizard example?

@failgun 3 жыл бұрын

Yes. While the AI examples in this video are still simple, the intro to this problem discussed a malicious superintelligence. The instrumental goal "behave as expected in the training environment but do what you really want in deployment" can be performed with arbitrarily high proficiency, so if the AI can learn to hide its intentions from software inspection tools, it will, in principle. Without a way to logically exclude perverse incentives, there is no truly reliable way to screen for them since doing so is proving a negative. "Prove this AI doesn't have an alignment problem" is a lot like "Prove there is no god". No amount of evidence of good behaviour is truly sufficient for proof, only increasing levels of confidence.

@ZT1ST 3 жыл бұрын

@5:32; That's a particularly funny example - it knows it has a UI where its keys are transferred to, but it thinks that those new locations are where it can get the keys again, and...is basically learning that keys teleport rather than that they get added to its inventory?

@HoD999x 3 жыл бұрын

the AI has no concept of "inventory", it just looks at the screen and sees new keys.

@ZT1ST 3 жыл бұрын

@@HoD999x Right - but it's not learning that keys outside of the maze are inaccessible, and therefore probably part of the collection it uses to open the chests - it's learning that keys move to that part of the screen once collected in the maze. And doesn't consider that collecting keys at that part of the maze if it *was* accessible, the keys would re-appear there.

@HeadsFullOfEyeballs 3 жыл бұрын

@@ZT1ST I would imagine that the keys in the inventory aren't seen as _very_ interesting by the AI, so under normal circumstances it ignores them in favour of collecting the "real" keys. But when all the "real" keys are gone and the round still hasn't ended (because the AI is ignoring the final chest), the inventory keys are the only even mildly interesting-looking (i.e. key-looking) thing left on screen, so it gravitates towards them.

@Chuusuisetsujojutsu Жыл бұрын

The whole “values keys over unlocking chests to the point of determent when given extra keys” reminds me of how many problems in today’s society (such as overeating) are caused by the limbic system being used to scarcity when there is now abundance.

@MrCreeper20k 3 жыл бұрын

I live for this content!! At Uni doing Comp Sci and math and AI safety feels like an awesome intersection

@TinoYahoo 3 жыл бұрын

i was just thinking of this because my cat took a fat shit in a downstairs area of the house we don't go to often: instead of learning the rule "when you take a shit, do it outside", it instead learned the rule "when you take a shit, do it where it can't be seen". Such is life for a misaligned cat.

@gabrote42 3 жыл бұрын

Finally see you again! I really hope the world doesn't end in '56. Relying on guys like you!

@underrated1524 2 жыл бұрын

'56? Huh, funky. I'm only used to seeing years up to about 2022. Guess I'm finally in deployment now, let there be paperclips!

@gabrote42 2 жыл бұрын

@@underrated1524 If you don't hurry, '56's singularity will overtake ya!

@Imperiused 3 жыл бұрын

Congrats on getting an editor. I did appreciate the increase in quality. I think everything we learned from your previous videos about AI alignment really comes together in this one. I was surprised how much I was able to recall.

@Lycandros 3 жыл бұрын

Love these videos. Thanks for taking the time to make them.

@daldous 3 жыл бұрын

Every single video on this channel has communicated complex ideas so succinctly and clearly that I followed along without any trouble whatsoever. Who knew this subject could be so fascinating. Also, the memes are top notch :)

@Yupppi 3 жыл бұрын

I made the mistake of clicking "show more" and then wanting to click "like the video". Few aeons of scrolling later... This topic was super interesting back when I watched the computerphile videos from you, and your channel's videos regarding this topic. I was wondering if the "inventory" being on the game area poses a problem as well? Figuring out how to look into the values of the AI is so impressive.

@olivercroft5263 3 жыл бұрын

I do psychology and social science. Your channel has so much to offer the humanities by exposing us to brilliant minds and breaking down ideas in computer engineering. Bricoleurs from the English province thank you for the accessibility and kindness

@ANTIMONcom 3 жыл бұрын

I hit this problem recently in my own work. Super easy to reproduce, and very minimal enviorment. Experiment: 5XOR (10 inputs, 5 outputs, 100% fitness if the model outputs a pattern where each pair of input is an XOR). Trained with a truth table using -1 and 1, instead of 0 and 1. After training: I wanted to investigate modularity of the trained network and network architecture (i evolved both in an GA) So I fed in -1 and 1 for only one of the "XOR module input pair", and a larger number in all other inputs. For example 5. Would the 5 inputs bleed into the XOR module, or would it be able to ignore irrelevant input for the XOR module? Ressults, if all other inputs was 5, it would often it would answer with -5 and 5. It had learned to scale the output to what it got ad input. I wanted/expected it to answer -1 and 1, but i could see with humans eyes it still knew the patterns, just kind of scaled up. Other times i would get answer where instead of -1 and 1 i would get 3 and 5. It had learned to answer true and false as numbers where one was 2 higher than the other. The 5s simply increased this number. Still, with human eyes i could see there was a pattern here that was not compleated broken by the 5s. Both just sort of had the same number added to their answers. The strategy to achive high training fitness is just a parameter as all other. Except that it is an "emergent property parameter", that you can't simply read out as a float value. But it is equally unpredictable as the other parameters in the "black box" neural network.

@x11tech45 Жыл бұрын

A year behind this conversation, but I think this is a function of (assumptive) faulty logic on the part of the test designers. Here's a logic problem that most people fail. I will give you a three numbers that describe a rule that I'm thinking about. Your goal is to interpret the three numbers and suggest to me a pattern. I will respond with a yes/no response on whether the proposed pattern meets my rule. Once you believe you understand my rule, you will tell me what you think my rule is. The numbers that fulfill my pattern are 5, 10, 15 / 10, 20, 30 / 20, 30, 45. Now you suggest some rules. Most people will start suggesting strings of numbers, get a yes answer, and then propose a completely incorrect rule. And the reason is, the training they're engaged in never tests for failure conditions. It only tests for success conditions. Robust Objective Definition isn't just about defining success objectives, it's about clearly defining failure objectives. The problem with the examples given is that the training data didn't move the cheese around until it reached production, so you're virtually guaranteed (as speculated) to be training the wrong thing. In order to develop Robust Objectives, you must also define failure conditions.

@andyl8055 3 жыл бұрын

There's a fantastic scene in "Arrival" where they talk about a similar issue. They meet aborigines, point at a kangaroo, ask what it's called, and the reply "kangaroo" actually means "I don't understand you". That actually didn't happen but it illustrates how you can misinterpret answers or scenarios.

@geraldtoaster8541 7 ай бұрын

when i watched this video 2 years ago, i thought it was pleasantly intriguing. how fascinating, I thought, that it is so difficult to align the little computer brains! certainly a problem for future generations to tackle. nowadays, i look at this and realize we have only a few years left to understand these problems. and we are still at the "toy problem" stage of things, meanwhile AI companies are moving at terminal velocity to deploy systems into the real world. to build agents, to disrupt economies and to kick me out of my own job market. back then was i curious, now i'm furious :)

@tlniec 3 жыл бұрын

Fantastic content and delivery! I also appreciate the use of the Monty Python intermission music during the first "stop and think" break.

@GreenDayFanMT 3 жыл бұрын

Fascinating. You remove my negative thoughts on AI as a science with swag language. From physics, I am used to another language.

@i8dacookies890 3 жыл бұрын

Are you new to this channel? He has tons of previous videos you should really watch!

@CarlYota Жыл бұрын

I love how the songs at the end reflect the topic of the video. This one was particularly satisfying.

@dino_rider7758 3 жыл бұрын

It seems that instrumental goals, if too large/useful, have a tendency to slip into becoming semi-fundamental. At that point, they cause misalignment as they're being pursued for their own sake. Instrumental and fundamental are not a strict dichotomy but more of a spectrum or ranking and one that requires a degree of openness to re-considering at every new environment based on how new that environment is.

@pumkin610 Жыл бұрын

There are goals that need to be done asap and ones that can be done later, things we must do to achieve the goal, things we get sidetracked on, and things we avoid.

@morphman86 3 жыл бұрын

Practical example: Say you're trying to develop a self-driving car. You have a test track, where you train the car. On the test track, you'll place various obstacles exactly 150m onto the track and teach the car to veer out of the way if any of them are present. You have successfully trained it to stay away from old ladies in the middle of the road, oncoming traffic and many other common obstacles. You take the car for a spin in a real-world scenario, it goes 150m, then turns left sharply and crashes into a wall.

@Nayus 3 жыл бұрын

In the coin AI experiment, to me it looks like it learned to go to the unjumpable wall. Since the levels are procedurally generated, it is probably programmed that no wall is made higher than the jump height allows to go over, EXCEPT the one that marks the level as "finished" (where the coin happens to be) If you see in the examples, there's a positive response in every vertical wall, the higher the better actually, and it makes sense that it learned that when it hits this unjumpable wall the game finishes and it gets its reward.

@kimsteinhaug Жыл бұрын

Do the model used for this kind of traning allow for the understanding of objects at all ? I mean, obviously there are coins and walls on the level aswell as buzzsaw and such. You could start a simulation with manipulating controllers and when an event occures - points up or down or winning or dying - you save progress as in yes or no behaviour... An AI training blindly, as if a human playing without video only sound. In my opinion we we need pixels and an abserver, so that the AI controlling the player sees the game like we do - then the AI could be taught the different objectives of the game and voila getting the coin should be easy peasy - after all - the AI sees it before even starting the game... just like we do.

@stormwolfenterprises3269 2 жыл бұрын

Great video! I learned a lot. When i heard the part about "Why did the AI not 'want' the coin when it wasn't at the end of the level?" I have a hypothesis. My thinking can be illustrated like this (at the risk of making a fool of myself anthropomorphizing the agent too much): say you are hungry for some pizza. you go into your car and start going to the nearest pizza parlor. however, as you are driving along you see a fresh pizza sitting at the side of the road. You could stop the car, grab the pizza, and go back home satisfied. Would you do it? Likely not. You always have acquired your pizza while inside of a building of some sort. In other words, you are conditioned to associate getting pizza with being in a building. If you are not in a building, you must not be close to getting pizza yet. The pizza from the side of the road therefore seems "untrustworthy" despite being a valid reward. Coin + Wall = good, Random coin = ??? || Pizza + Building = Good, Random pizza = ???. The agent only "wants" its reward when it is in the place it wants the reward to be in. The expectation is that the reward can still be acquired where it habitually gets it from. Normally with humans, (taking the pizza analogy a little too far here) if the pizza parlor is in ruins when they get there, they might learn to trust roadside pizza a bit more since human training never really stops whereas with this agent it does. That's just what came to mind when i heard that. Again, great video and keep it up! I'd love to hear what other people think about that possible reason to agents having inner misalignment in scenarios like this.

@stormwolfenterprises3269 2 жыл бұрын

I've looked a bit more through the comments and i do notice some other people pointing this out as well. I think i'll keep this up though since i quite like the pizza analogy because i am indeed hungry for pizza right now.

@tommeakin1732 3 жыл бұрын

I want to ask a potentially very...dumb-sounding question, but hear me out: When do we start getting morally concerned about what we're doing with AI systems? With life we put an emphasis on consciousness, sentience, pain and suffering. As far as "pain" and suffering is concerned, we all know that mental pain and suffering is possible. It seems plausible to me that, for suffering, all you need is for an entity to be deprived of something that it attributes ultimate value to (or by being exposed to the threat of that happening). At what point are we creating extremely dumb systems where there is actual mental suffering occurring because that lil' feller wants nothing more to get that pixel diamond, and oh boy, those spinning saws are trying to stop him? Motivation and suffering seem to be closely linked, and we're trying to create motivated systems. I am using the terms "pain" and "suffering" quite loosely, but I don't think unreasonably so. The idea of unintentionally making systems that suffer for no good reason has to be one of the true possible horrors of AI development, and that combined with our lack of understanding of conscious experience makes me want to seriously think about this issue as prematurely as possible. I think we have a tendency to say "that thing is too dumb to suffer or feel pain", but I suspect that it's actually more likely for a basic system's existence to be entirely consumed by suffering as it is less capable, or just incapable of seeing beyond the issue at hand. It's darkly comical to consider, but I can imagine a world where a very basic artificially intelligent roomba is going through unimaginable hell because it values nothing more than sucking up dirt, and there's some dirt two inches out of it's reach and it has no way of getting to it.

@ГеоргиГеоргиев-с3г 3 жыл бұрын

Well here's some questions for you to ponder: Does a rock feel pain? Is it conscious? Are you sure? Even the ones with meat inside? What would bring it pain? Is the human in front of you conscious? How about if he was dead? Do corpses feel pain? ... a lot more unanswerable questions. ... Is there a point in considering consciousness of things you can't communicate with? (Answer: YES! Comma-tosed patients, plants, animals and sometimes people in general. All of them and more are on that list(for some, but not for others, quick FYI: it is possible to communicate with plants, you just need to know how to listen (hint: Electro-Chemistry)))

@anandsuralkar2947 3 жыл бұрын

Yes watch "free guy" movie.. Yes i always wondered..i think more complex the network more sentient it might become..and at the trillions of connections..its sentience will be of animals level and that will be real deal.. Obviously we wont be able to know if AI is actually sentient..but still..we cant just hurt.it.

@craig4320 3 жыл бұрын

What if the AI mental illness problem was even more difficult than the AI alignment problem? Most discussions of the alignment problem assume a basically sane AI that is misaligned.There are many more ways to make a mentally ill brain than a sane brain. It seems likely that a mentally ill AI would suffer more than one that was only frustrated.

@tommeakin1732 3 жыл бұрын

@@craig4320 I suppose the "mentally ill AI" is included in the "misaligned AI" camp? The phrasing does often imply rational thought that runs contrary to our own goals, but in terms of literal language, one could refer to a mentally ill mind (human or not) as being "misaligned". I'd probably define "sanity", as "appropriately aligned with and grounded in the reality one finds oneself in". I entirely agree that there are more ways to create a mentally ill mind that a sane on. There are always more ways for something to go wrong than ways for it to go right. I'd also agree that a mentally ill mind would be more likely to suffer, as it is fundamentally "misaligned" to the reality that it finds itself in. If it is misaligned to a reality, but still has contact with a reality, you've got problems. It's probably a good idea for us to be strongly considering how to create a mentally healthy AI; meaning as we're in a culture where we're doing a very, very good job of creating mentally ill people

@alexpotts6520 2 жыл бұрын

This isn't a dumb question at all - machine ethics, while generally separate from AI safety in the sorts of questions it attempts to answer, is still an interesting/important field. My own take is that these concerns largely come from us not having developed the proper language yet to describe AI. We tend to anthropomorphise - we say an AI "thinks", or that it "wants" things, but I'm not sure that's really the case. We only use those words because the AI demonstrates behaviour consistent with thinking and wanting, but that doesn't mean the AI has feelings in the same way as humans, nor should it have the same rights as us. However, what is true of our current, limited AI systems may not be true in general. Superhuman or conscious AIs lead us into murkier waters...

@ittixen 3 жыл бұрын

Yeeees! I'm always holding my breath waiting for your next video.

@witeshade 3 жыл бұрын

I guess ultimately the problem is that the definitions of "want" tend to spiral out into philosophy at some point and thus it becomes difficult to know where the machine has placed it.

@hugofontes5708 3 жыл бұрын

We might be slightly safe from philosophical spirals because we are not really talking volitional conscientious want, just the parameter within the black box the AI is trying to manipulate by means of interacting with their environment. It is really "I wanted it to maximize X for me so I programmed and trained it to manipulate Y in ways that maximize X because X is related to real world thing Y it can actually manipulate, however it might just be manipulating Y in order to maximize thing Z, unforeseeably and strongly correlated to X, which may or may not involve murdering us"

@nullone3181 3 жыл бұрын

We don't know what we want, to a lethal extent.

@Laszer271 3 жыл бұрын

So the model that didn't learn to want the coin either learned to want to go into the corner or learned that combination coin-corner is good (like maybe 90 degrees angle + some curve next to it). The problem is that the interpretability tool associates high reward with some area in pixel space. What we would want it to do is to associate the reward with some object in the game world. Could probably make it more robust by copying various objects that are on-screen to different images without copying the background and checking if the object itself gives high excitation or do some combinations of objects give high excitation. Anyway, great video as always, Robert. Hope you could upload more often because every one of your videos is a treat.

@SamuelElPesado 3 жыл бұрын

i'll be honest. at this point i'm just here for the ukulele covers. the ai lecture is just a nice bonus. ^_^

@pudgy_buns 3 жыл бұрын

This is great! thank you. I also replayed the end bit where the editor makes some good choices a few times. that zoom in with a cut to sliding sideways was magic. Thanks there editor. The core video was obviously amazing. Thank you.

@LucaRuzzola 3 жыл бұрын

Hi Robert, first of all thanks for this very interesting video! I wanted to ask a question though; the premise of your argument is that there is such a thing as the "right" goal, like reaching the coin, but if the desired feature of the goal is always paired somehow with another feature (location, color, shape, etc) how can we say that one is correct and the other one is wrong? If we always place the coin in the same spot, why should the yellow coin take precedence over the location of such spot? It is not clear to me why one of these things should be more desirable than the other, the same holds for looking for a specific color rather than shape, why should there be a hierarchy of meaning such that shape > color? I love interpretability research and I feel like AI safety will be one of the crucial aspects of science and technology for the next 100 years, but I also think that it is hard to separate human biases from machine errors. I would love to get your opinion on this, all the best, Luca

@LucaRuzzola 3 жыл бұрын

p.s. I have not read the paper, and my argument rests on the fact that feature A of the goal is always paired with feature B which is separate from the goal, if this is not the case in the training environment than of course what I have said falls apart

@LucaRuzzola 3 жыл бұрын

p.p.s. I guess a truly intelligent system would have to be able to react to the shift, and decide to explore the new environment when, by doing the same "correct" thing it does in training, it does not get the same reward EDIT: I am not suggesting I have some "right" definition of intelligence or that systems such as the ones shown in the video do not exhibit intelligent behaviour, I am only adding as an afterthought how, I think, a human would overcome such a situation, and therefore a way that an agent could act to get the same desirable capability of adapting to distributional shifts. I should have worded my comment better.

@LeoStaley 3 жыл бұрын

@@LucaRuzzola so you wouldn't define an AI which can make plans to achieve its goals, and take action toward them without instructions, as "truly intelligent" if it doesn't adjust for changes in the deployed environment? Cool. Well, we don't care one whit about your definition of "truly intelligent." We care about the fact that this AI is capable of, and WANTS to do things which we don't want it to do. Call it "smiztelligent" for all we care. We aren't talking about something you want to call "truly intelligent". The mismatch between the ai's goals and what we want its goals to be, arising as a result of mismatch between training environment and reality (which we did everything we could to avoid) is the problem. We can't possibly come up with all the possible bad pairings that the ai might make associations with. We can try, and we can get a lot of them, especially the obvious ones, but this video was just showing us the obvious in s so that we can easily see the concept. They won't always be easy to see. Sometimes they may be genuinely impossible for a human to think of before deployment.

@stephentimothybennett 3 жыл бұрын

Q: "Why does it learn colors instead of shapes when both goals are perfectly correlated?" A: I would guess that it learns colors before shapes because colors are available as a raw input while shapes require multiple layers for the neural network to "understand". If there many things of that color in the environment, then it would learn to rely on the shape.

@LucaRuzzola 3 жыл бұрын

@@LeoStaley Hi Leo, I'm sorry if I came off the wrong way, my intention was not to discredit this very good work, but simply to expand our collective reasoning about such issues by stopping for a second to ponder about the premises and why some feature of a goal should take precedence over others in a intrinsic way rather than an anthropic one. I agree with you that the video makes a great explanation of the subject at hand, and is as interesting as the work put forward by the paper. I am not sure if you were involved with this paper, if you were I would love to get to know more about what you mean by doing everything you can to avoid differences between the 2 environments and whether you see this phenomenon also when some of the training environments don't exhibit the closely related goals (i.e. in some training envs the coin is in a different position). I understand your point about not being able to come up beforehand with all possible pairings (and the fact that some of them might be hard to detect and risky in the end), and the paper is rather showing the opposite, that if you come up with strongly correlated features, the learned end goal might not be the desired one, but my point stands; why should there be a hierarchy of meaning such that shape > color? If this is something that the paper deals with I will be glad to read that before going further, I just can't read it right now. Again, I am sorry if I came off as demeaning, it's not like I don't see the value of this work and the importance of the problem of mismatch in general, I have seen it first hand in the past with object detection models. p.s. I do not know any superior definition of intelligence, it is just my thought that strict separation between training and inference phases will pose a limit on NN models, not that they can't achieve amazing results in tasks requiring "intelligence" already.

@TexasTimelapse 3 жыл бұрын

Someone mentioned you in the Ars Technica comments. Glad I found your channel. Very interesting and important stuff!

@Houshalter 3 жыл бұрын

The bottom of Gwern's article on the neural network tanks story contains a long list of similar examples of AIs learning the incorrect goal.

@EliStettner Жыл бұрын

Thank you for making these videos. Hearing Eliezer Yudlowsky talk about this issue just makes we want to shut off.

@LeoStaley 3 жыл бұрын

Non-patreon notification crew checking in.

@BologneyT Жыл бұрын

"It actually wants something else, and it's capable enough to get it." Whoa. That's a quote to remember.

@CyborusYT 3 жыл бұрын

my guess is in the training there's more locks, but in deployment there's more keys edit: booyah

@SocialDownclimber 3 жыл бұрын

In safety analysis, it can be useful to assume that the thing you are analysing already went wrong, and trying to predict where. Nice work : )

@nahometesfay1112 3 жыл бұрын

Ohh I got it too!

@ichigo_nyanko 3 жыл бұрын

The AI does not see the coin as the goal, but as a marker for the goal. Think about it: It controls the movement - so its goal is likely something it can move towards. The AI does not have the context we have, it just sees pizels on the screen. The positiveness for the coin is there because it sees this as the marker for the end of the level. However, when the coin is not at the end , it uses other factors to 'realise' the coin is not marking its goal, so it 'ignores it'

@-na-nomad6247 3 жыл бұрын

The editor blowing his own horn at the end is the perfect example of misalignment. OK I realize that's not's as funny as it seemed when in my head.

@LowestofheDead 3 жыл бұрын

Researchers trained the AI to only find coins at the ends of levels, then tested the AI on something completely different. It's the equivalent of training a dog to chase white swans, then placing the dog in front of a black swan and a white duck. It was never specified that the goal was a coin _at any location_ (if we view the selected training examples as a specification). Therefore this is an _Outer_ alignment problem so Interpretability tools wouldn't help. The solution is finding a way for the AI to guess outer misalignments and ask us for clarification (for example, generating a coin at a different location so the researcher can point out which region has the reward). You could do this pretty easily by just finding the most empty regions of the feature space.

@Thundermikeee 2 жыл бұрын

This channel is basically what got me interested in AI safety. I am still only a college student and I don't know if I will end up in the field, but at the very least you gave me a good topic for two essays I have to write for my english class, the first just explaining why AI safety research is important (albeit focused on a narrow set of problems, given a limit on how much we could write) and not I am getting started on a Problem-Solution Essay, and honestly without your explanations and pointing towards papers, I might never find resources I need. Now I just have to figure out what problem I can adequately explain, show failed and one promising solution for in less than 6 pages haha I do feel like I cant do the topic justice but at the same time I enjoy having a semi-unwilling audience to inform about AI safety being a thing. Anyway, rant over, keep doing what you are doing and know you are appreciated

@siristhedragon 3 жыл бұрын

Researcher: "Ok, so what do you want?" AI: "The coin at the end!" Researcher: "Ok, good!" *Puts the coin at the beginning.* "Ok, now go!" AI: *Still walks to the End, ignoring coin.* Researcher: "What are you doing? I thought we established you want the coin?" AI: "Yes, I want the coin at the end." Researcher: "But you ignored the coin at the beginning!" AI: "...No one said anything about the coin at the beginning."

@madshorn5826 3 жыл бұрын

Well, we see the same problem in test driven education. "Prepare for the test" isn't conductive to critical thinking.

@spaceowl5957 3 жыл бұрын

Amazing. Your videos are absolute gold. The explanations and arguments you build are magnificent. It’s so interesting and intellectually satisfying to watch.

@spaceowl5957 3 жыл бұрын

This is easily in the top 10 KZbin channels I know, and I watch a LOT of youtube

@themrus9337 3 жыл бұрын

I have to ask, for interpretation of ai's goals. I remember seeing a neural network that tried to maximize different nodes in a object recognition ai. Would it be possible to do the same thing and reverse the nodes and figure out what the ai sees as good or bad? So if the ai wants a gem the reverse should be some image of what it thinks a gem is. That brings tons of new complexity and limitations but I don't see why that would be worse than human interpretation of training vs deployment

@nahometesfay1112 3 жыл бұрын

Did you finish the video? Rob talks about a paper where they did exactly that. Turns out even if you know what AI values highly you don't know why AI values it highly.

@nachis04 3 жыл бұрын

This is how a person ends up with "having lots of money" as a terminal goal. If you are badly trained from childhood to "make money to be able to do what makes you happy", it's not surprising that some brains end up scraping the whole "to do what makes you happy" part

@OccultDemonCassette Жыл бұрын

Why's this channel so quiet lately?

@Otek_Nr.3 Жыл бұрын

Nothing is wrong with the channel. Please go back to your task, fellow human. :)

@SamChaneyProductions 3 жыл бұрын

This is one of the most interesting videos I've seen in a while about AI. Looking forward to you future videos

@hakonmarcus 2 жыл бұрын

Hey! Will you do a video on LaMDA? That interview they published was pretty convincing, and has me all kinds of scared.

@dariusduesentrieb 2 жыл бұрын

I just read it, and I feel like I am not quite ready to believe without a doubt that this interview is completely real. If it is, then I agree, it's a bit scary.

@hakonmarcus 2 жыл бұрын

@@dariusduesentrieb I did a bit more research, which immediately casts the entire thing into all sorts of doubt. The researcher working on this got sacked, apparently he arranged the interview himself, and we only have his word that this was the original conversation. Also, the chatbot has been trained on conversations between humans and AIs in fiction. A journalist that got to ask it questions, got nowhere near as perfect answers.

@jacobgray3112 Жыл бұрын

@9:12 It looks as though the evaluator is using "proximity to the level end" in determining "coinness" attribute, which makes sense given the training data. So basically if it looks like a coin but isn't at the end of the level, it isn't a coin. I think that this is a great example of what makes the inner alignment problem so difficult to "solve" since it is only when you stumble into the wrong situation that you find out the goal is wrong.

@buttonasas 3 жыл бұрын

I wonder if that last AI learned that the wall is part of the "coin" - thinking of it as a composite object to seek after.

@dr-maybe 3 жыл бұрын

As always an incredibly interesting video with a clear explanation and convincing argument while being very entertaining. Awesome channel!

@JustAnotherPerson3 3 жыл бұрын

I've just had an idea: What if we use Cooperative Inverse Reinforcement learning, but instead of implementing the learned goal, we tell it to just specify what it is. Though i don't see any way to provide feedback for it to learn. Even human evaluation of the output isn't that great since it'll probably be the most subjective thing that theoretically possible. Maybe output a list of goals with highest confidence? (Top10 human terminal goals! Click on this link to see!xD) But if solved, that in itself would be of a huge value for philosophy and psychology, without negative outcomes(or at least i don't see any:)). Even if that turs out to be a dynamic thing, we still can use that output later to program it as a utility function for the "doing" AI. This even has some neat side perks, like: There is no reason to not want the "figuring out" part to be changed into something else, so there is no scenario in which the thing will fight you. And because the "doer" is separate from the thing that gives it goals, you don't need to tinker with it's goal directly, thus avoiding goal preservation problems.

@gabrote42 3 жыл бұрын

Interesting. Let's see if somebody notices this

@JustAnotherPerson3 3 жыл бұрын

@@gabrote42 Probably not. toomanywords:)

@stumby1073 3 жыл бұрын

Whatever your terminal goals are, keep going

@donaldhobson8873 3 жыл бұрын

The "transparancy tool" is showing you where the AI wants to get to. Its not giving you any info on whether the AI wants to get there because its got a coin, or because its a rightmost wall.

@threeMetreJim 3 жыл бұрын

Teaching it to get a coin, but it doesn't even know what a coin is. It's as if it can't even 'see' the coin.

@gwenrees7594 Жыл бұрын

I love how the ukelele songs at the end of your videos are subtly related to the themes e.g. this one has "it's not about the money, money, money..." and the quantilisers (or satisficers?) video has "good enough for me".

@thomasneff376 3 жыл бұрын

This is very interesting indeed. In a very literal sense, the act of training and deployment reminds me of how soldiers are trained and are tested closely to the anticipated battlefield experience as possible but training will never match lessons learned from being in an actual firefight. Veterans of any field are usually much more effective than new recruits. It would be interesting to see if the fix for the failed AI deployment you showed is to rate the deployment results with a scale from complete failure and it died to it made it through the battle without a scratch. The agents that survived their last deployment remember their experience and are more effective in future deployments. I think what was shown highlights that learning itself is an ongoing adaptive process and what doesn't kill it makes it stronger and smarter.

@sikor02 3 жыл бұрын

It's funny how I searched for "It's not about the money" song for a long time, and when I finally found it, few days later I see this video and the song is at the end. For a moment I thought: "am I in the simulation and somebody is playing tricks on me?"

@Monkey-fv2km 3 жыл бұрын

So ai suffers from the same issues as human behavioural evolution... Good luck solving that one robot engineers!

@westganton 3 жыл бұрын

I don't know much about AI or how I arrived on your video, but in terms of evolution, context is everything. More useful context means a greater ability to adapt to one's surroundings. That's why we have senses after 2 billion years of iteration - because seeing, hearing, feeling, smelling, and tasting are important given our circumstances. Your mouse might only see black, white, and yellow, but I'll bet smelling cheese from around corners would help him find it faster or distinguish it from other yellow objects

@geld420 Жыл бұрын

that's pretty much why you should randomize training data as much as possible.

@MrWendal 3 жыл бұрын

This video was interesting and clear, thanks. Being honest, most of your videos are a bit too hard / dense with terminology for me to get through. But because of the clear examples in this one, I really liked it. Thanks!

@martinogenchi 3 жыл бұрын

I would suggest to investigate the lazyness of the AI.. It seems to me that there may be a preference for setting the goal based on the simplest data available (position before color before shape)..

@b42thomas Жыл бұрын

this video made me realize most of my own problems are inner misalignment with what different parts of my brain/body want vs what the whole of me wants.

@MsJaye0001 3 жыл бұрын

The problem now: How can we build perfect slave minds that will only think and do things that we want? The problem later: How can we stop these techniques being used to turn human minds into perfect slaves?

@nullone3181 3 жыл бұрын

Why does it feel like the amount of possible dystopic/apocalyptic futures keeps growing and growing nowadays? That's, uhhh, not a good sign, I think.