Using Dangerous AI, But Safely?

  Рет қаралды 107,370

Robert Miles AI Safety

Robert Miles AI Safety

Күн бұрын

Пікірлер: 1 300
@Alexander01998
@Alexander01998 Ай бұрын
...but will pregnant Elsa find the correct integer sequence to recover her favorite rectangle from evil Spongebob?
@KindOfWitch
@KindOfWitch Ай бұрын
or will the pigeons sex
@puppergump4117
@puppergump4117 Ай бұрын
I want to know who got Elsa pregnant and congratulate him with a knuckle sandwich.
@bierrollerful
@bierrollerful Ай бұрын
Nevermind superintelligent AI; what we should be worried about are those lads who(m?) wrote those programming problems.
@puppergump4117
@puppergump4117 Ай бұрын
@@bierrollerful Endless string operations
@diabl2master
@diabl2master Ай бұрын
​@@bierrollerfulit's "who"
@Reithan
@Reithan Ай бұрын
I feel like, as a casual and professional user of these systems, but not someone well-versed in the minutiae of how these companies are building them, that they spend 99% of their 'safety' efforts on "make sure the AI doesn't draw a dick" and "make sure the AI doesn't draw a bomb" and only 1% on "make sure the AI doesn't actually do something unsafe."
@AileTheAlien
@AileTheAlien Ай бұрын
🤔I think it's because the average person doesn't really understand AI. Like, it's in the news if the AI swears, insults people, draws genitals, etc, but most people can't really understand that AI fundamentally is an inscrutable computer system. (EDIT: I originally wrote "code" but am not interested in pedantic nit-picking on specifically what type of computer system - they're inscrutable, and that's dangerous.)
@hellblazerjj
@hellblazerjj Ай бұрын
​@AileTheAlien it's not even code. Billions of weights isn't code. It's a mathematical object that we grew like a plant that drinks data and uses math to learn complex concepts. There is very little actual code. The file size is like 99% weights.
@SimonBuchanNz
@SimonBuchanNz Ай бұрын
I'm not sure there's a distinction - it doesn't seem like an evil AI would have an intrinsic end goal of killing all humans more than drawing dicks, so attempts to catch the latter also serve the former. That said, I'd love to see the news that GPT-5 was arrested for attempting to murder Altman so it could draw more dicks, but unfortunately we seem to be in the "bad news only" timeline...
@AileTheAlien
@AileTheAlien Ай бұрын
@@hellblazerjj 🙄It's code by a different name - it's still some files that configure how a computer system operates.
@MrFlarespeed
@MrFlarespeed Ай бұрын
​@@AileTheAlien if you don't understand how ai works, don't contradict someone who does. In terms of file size, they were absolutely correct, most of the code is using node weights from a separate file. That other file is massive compared to the code, because the code is iterating over the weights. Whether or not the weights file counts as code however is semantics.
@tostadojen
@tostadojen Ай бұрын
The APPS problems are less nonsensical than actual specifications I've been asked to implement. It makes total sense to me that a programming AI will be tested on how it handles a badly-written spec.
@feffy380
@feffy380 Ай бұрын
They seem formulaic, probably because they're AI generated following some sort of template
@Turalcar
@Turalcar Ай бұрын
Looks like run of the mill competitive programming problems
@erictheepic5019
@erictheepic5019 Ай бұрын
@@feffy380 Nah, because AI generated text doesn't tend to come with simple spelling mistakes.
@somegremlin1596
@somegremlin1596 Ай бұрын
@@erictheepic5019 maybe the spelling mistakes were on purpose to make the dataset more complete. cuz the ai needs to be able to understand bad input
@dipi71
@dipi71 Ай бұрын
8:20 »Are dataset compilers ok?« - I do worry about that. 10:49 »Trusted AI« doesn't exist because AI doesn't exist. I *can* trust LLMs to give me responses that are false, unsafe and incomplete. 27:29 Cliffhanger behind yellow dot, arrgh - I must know! I'd better read that PDF then…
@Voltlighter
@Voltlighter Ай бұрын
6:20 this is referring to 重し蟹 / Heavy Stone Crab, a mythical monster from the anime Bakemonogatari that possesses humans and steals their weight. Senjougahara Hitagi is one of the lead characters and she's possessed by the crab. Deishuu Kaiki is a conman that bilked Hitagi's parents out of all their money, claiming to be able to save her from the curse.
@ConnorNolan
@ConnorNolan Ай бұрын
I don’t understand what it means by “steals their body weight” like do they get really skinny or do they look the same but float away?
@philiphunt-bull5817
@philiphunt-bull5817 Ай бұрын
​@@ConnorNolan I'm commenting to get notified when someone posts the answer
@gagrin1565
@gagrin1565 Ай бұрын
@@ConnorNolan Closer to the latter, in line with the supernatural nature of the story. Hitagi drops in weight to a mere 5kg, but is otherwise physically identical.
@NoConsequenc3
@NoConsequenc3 Ай бұрын
@@ConnorNolan she effectively only weighs about 5lbs, but her mass, density, and volume don't change. You can think of it like gravity not working on her normally, so she can do incredible acrobatics but will also be carried away on most winds.
@SirRichard94
@SirRichard94 Ай бұрын
@@ConnorNolan it literally made her weigh less but its mostly a metaphor because it removed the weight of her trauma she wasn''t able to face yet.
@evilmurlock
@evilmurlock Ай бұрын
19:00 The red team could also make the inocent code look more suspisous to fill up the 2% with innocent solutions
@Gamesaucer
@Gamesaucer Ай бұрын
Good thought, and a strategy that's nearly impossible to counter... but this only impacts usefulness, not safety. And being unsafe is what red team (and by extension any rogue AI) needs to win. False positives (which is innocuous but suspicious code) doesn't matter, false negatives (backdoored but non-suspicious code) is the thing red team is trying to optimise for. Usefulness is only a consideration for red team insofar that blue team might compromise on safety to get more usefulness. Look at it this way: if all of its output is immediately thrown away, a rogue AI can be safely run without being able to do anything bad. Which is what your suggestion would result in. It'd just make running the AI pointless, not unsafe.
@misterperson3469
@misterperson3469 Ай бұрын
That is an interesting idea, though it would require a kind of suspicious code that does not alert the humans. Like naming the project EVIL CODE FOR DOING HARM with completely normal code inside probably gives a false positive but the humans will probably still pull the plug on you for acting that way. The attack protocol has to not only appear to be putting out functional safe code but it has to seem like its acting in good faith to a human. Say it writes sloppy and inefficient code that still functions to try and get flagged. If a weaker trusted model can see that its sloppy then the smarter model is probably up to something since it should perform as well or better across the board.
@tayablackrose29
@tayablackrose29 Ай бұрын
@@Gamesaucerreal life red team (China) just needs to kill usefulness for the blue team(West) thus surpassing with building newer stronger untrusted model. For the red team they don’t care about usefulness and will not deploy their stronger model to be used by its population. They will just need to make it strong and dangerous enough to release into blue teams web, physically isolating its own web. Shut down digital border so to speak and wait until their rouge model wracks havoc outside.
@Pystro
@Pystro Ай бұрын
I had a very similar idea, except attacking the case where the attack model is used to scrutinize code. In the extreme case the red team AI would flag all safe code as suspicious, all malicious code that someone else wrote as suspicious and all malicious code that it wrote itself as safe. Then I realized that it would probably be suspicious in itself if GPT4 were _much much_ worse at spotting bad code than GPT3.5 is. ... Or if all the code that GPT4 wrote was filled with obscure patterns and super roundabout ways of doing things, as @misterperson3469 said.
@PeterBarnes2
@PeterBarnes2 Ай бұрын
If I read the paper correctly, this was actually discussed in the paper, yes.
@JorWat25
@JorWat25 Ай бұрын
I need more of Robert reacting to the ridiculous setups for the coding problems.
@NubsWithGuns
@NubsWithGuns Ай бұрын
If he asks for tea RUN!!
@TheManFromOctober
@TheManFromOctober Ай бұрын
Another Civilization playthrough would also be nice!
@hunted4blood
@hunted4blood Ай бұрын
This is the most optimistic AI safety news I've seen in months!
@AileTheAlien
@AileTheAlien Ай бұрын
🙃I'm still pretty concerned that there's militaries / governments / etc, that are trying to get AI to be competent at hacking, while at the same time AI Lab Watch's safety scores are so lo. 😅
@IDSearcher
@IDSearcher Ай бұрын
Only 21% possibility of utter annihilation, we are so back!
@gasdive
@gasdive Ай бұрын
I'd suggest "an inconvenient doom" to counter any unrealistic optimism.
@artemis.nnnnnbbbbb
@artemis.nnnnnbbbbb Ай бұрын
​@@AileTheAlienor at flying drones, it's really really competent at flying military drones right now, and somewhat there at flying fighter jets
@noob78
@noob78 Ай бұрын
How is this optimistic?
@SupaKoopaTroopa64
@SupaKoopaTroopa64 Ай бұрын
There's still one critical flaw: misaligned humans. Try explaining this to someone who wants their company to be the first to unveil a revolutionary new money-collecting A.I. "Don't care. That will take too much time and effort. We need to be the first to market, or our competitors will be the ones riding the wave of unstoppable economic progress!"
@cyborgninjamonkey
@cyborgninjamonkey Ай бұрын
absolutely. copied from my reply on another comment in re Elon Musk: One of the safety issues Mr. Miles has talked about extensively is instrumental convergence. He's touched here and there on the potential issues arising from machine misalignment substituting instrumental goals for terminal ones, but just through introducing the topic it's obvious we're familiar with the subject already. Money is instrumental to the great majority of human goals, and the number of humans experiencing misalignment by treating cash as a terminal goal in itself rather than a resource to do what one wants is growing. Elon was born rich and has on average spent most of his money to get more money, now he's got a net worth the size of Portugal's GDP, a bunch of kids who hate him, and a faceless swarm of Stans who think he's actually a scientist and inventor rather than a salesman and PR rep who employs scientists and inventors. He does not seem to be getting more happy the bigger his money number gets; to paraphrase Mr. Miles, money is a shite terminal goal.
@WallabyWinters
@WallabyWinters Ай бұрын
Also we're learning through failures. We may only have a single shot on nailing the solution to this very complex problem
@Uriel238
@Uriel238 Ай бұрын
I'm also reminded of the Volkswagon computer that detected when it was going straight at a consistent speed over a long time (ergo was being tested for exhaust pollution) it would go from fuel-efficiency mode to low-pollution mode, thereby passing the exhaust test and misrepresenting its actual cruise exhaust output.
@TimPeterson
@TimPeterson Ай бұрын
making a car that's fuel efficient most of the time but priotizes polluting less when on a straight highway on cruise control sounds like a great idea
@EStewart573
@EStewart573 Ай бұрын
@@TimPeterson Nonetheless it was specification gaming, it would've been a better outcome if it simply polluted less all of the time.
@HeadsFullOfEyeballs
@HeadsFullOfEyeballs Ай бұрын
That wasn't an AI though, that function was deliberately added by human programmers to cheat at pollution tests. It was a case of _humans_ being misaligned, prioritising their careers and corporate profits over the interests of humanity as a whole.
@codahighland
@codahighland Ай бұрын
​@@TimPetersonSpoken like someone who has never driven in Los Angeles.
@greenaum
@greenaum Ай бұрын
@@TimPeterson This one specifically tested for being on a rolling road, under test. Volkswagen put in every check they could. Then it would enter low-pollution mode, and use up more than normal of some blue fluid or other. When it wasn't under test, it polluted more. That's not a great idea for a car, even if there's some legit circumstance that might resemble it. Not sure why they wanted to conserve this fluid, it was a while ago and I've forgotten some of the details.
@storyspren
@storyspren Ай бұрын
There's one problem with even the best protocols: the humans building the AI have to be trustworthy enough to employ them. And right now I'm not sure that can be said of us. Many of us? Yes. Most? Maybe. All? No. Those on top of corporate entities? Absolutely not.
@skyscraperfan
@skyscraperfan Ай бұрын
Yes, I am skeptical about the AI that Elon Musk builds at the moment. I think it's called "Grok". Does he really create it to benefit humanity. Maybe he is, but not with 100% likelihood.
@cyborgninjamonkey
@cyborgninjamonkey Ай бұрын
One of the safety issues Mr. Miles has talked about extensively is instrumental convergence. He's touched here and there on the potential issues arising from machine misalignment substituting instrumental goals for terminal ones, but just through introducing the topic it's obvious we're familiar with the subject already. Money is instrumental to the great majority of human goals, and the number of humans experiencing *misalignment* by treating cash as a terminal goal in itself rather than a resource to do what one wants is growing. Elon was born rich and has on average spent most of his money to get more money, now he's got a net worth the size of Portugal's GDP, a bunch of kids who hate him, and a faceless swarm of Stans who think he's actually a scientist and inventor rather than a salesman and PR rep who employs scientists and inventors. He does not seem to be getting more happy the bigger his money number gets; to paraphrase Mr. Miles, money is a shite terminal goal.
@peteMickeal33
@peteMickeal33 Ай бұрын
that's the key point. I would go as far as to call this entire era of AI a total scam. Let's assume we have a superpowerful and superintelligent system. We either: - We let that intelligence overrule that of humans, and, therefore, by definition it's not following our goals and we will have designed an economy that dispenses with humans, which is absurd. - We overrule its operation with our intelligence (btw, the only sensible scenario). We tell it what to do and we approve of what it does before it does. In this case, leaving aside the alignment problem, the quality of the goals of that system is limited by the ability and trustworthiness of the humans who will always be the first and last layer to that system. The second point is key because it means that AI will be as corruptible as any other thing that has existed before. As corruptible as humans.
@mr_b_hhc
@mr_b_hhc Ай бұрын
Fear not the predictive text machine.
@javiercampoy1707
@javiercampoy1707 Ай бұрын
And let's also consider not "misaligned humans" but also just good ol' human incompetence...
@geoffdavids7647
@geoffdavids7647 2 ай бұрын
Always a good day when Rob uploads a main channel video! The rational animations channel is great, but OG Rob is what I really look forward to.
@zeemouu
@zeemouu Ай бұрын
well you must not have good days very often 😅
@eragonawesome
@eragonawesome Ай бұрын
​@@zeemouu "this thing assigns the value of the day to good" does not imply "the lack of this thing assigns the value to bad"
@zeemouu
@zeemouu Ай бұрын
@@eragonawesome it was a joke.
@johnathanclayton2887
@johnathanclayton2887 Ай бұрын
I didn't realize he ran that one, both channels are great
@timl2k11
@timl2k11 Ай бұрын
Rational animations?
@Welknair
@Welknair Ай бұрын
6:12 I was not expecting the Bakemonogatari reference! For those that don't know, this is a popular anime series focusing around supernatural goings-on in a modern-day setting, often leveraging wordplay and folklore. The described events (a magic crab stealing Hitagi's weight) is a major plot point in the series.
@RobertMilesAI
@RobertMilesAI Ай бұрын
They didn't make that up!?
@victorhogrefe7154
@victorhogrefe7154 Ай бұрын
Spongebob, Spiderman, and Bakemonogatari of course...
@RanEncounter
@RanEncounter Ай бұрын
​@@RobertMilesAI All the monogatari series animes have roots in japanese folklore.
@fcolecumberri
@fcolecumberri Ай бұрын
@@RobertMilesAI Technically, they did make that up... by them I mean Japan.
@harpoonlobotomy
@harpoonlobotomy Ай бұрын
I'm completely unfamiliar with Bakemonogatari, but from the description alone I assumed it had to be an anime and/or folklore.
@orthoplex64
@orthoplex64 Ай бұрын
I like the pragmatism: it only works in the critical period (which we don't know how short is), and it only works with some probability, but it might be the best approach we've come up with so far, so we should be doing it anyway
@francomiranda706
@francomiranda706 Ай бұрын
I like this interpretation of AI Safety because it shifts the blame of "Evil AI" from "AI that wants to do bad things by itself" to "AI that is being prompted secretly to do bad things by the red team". This feels more down to earth and topical, as its hard to imagine skynet right now but easy to imagine a big AI company with a government/big corporate contract that knows junior devs are going to copy paste their code into production. Any malicious provider could incorporate these "red team protocols" and have a degree of certainty that somewhere in some production code for their client exists a backdoor, and even have some kind of signal behavior to find where it is. These are problems that could be happening right now, and even if they find them the AI company could blame it on hallucination or a bad training set.
@SimonClarkstone
@SimonClarkstone Ай бұрын
In a superintelligent AI, it could have its own internal Red Team, writing a subtly-misaligned AI that it uses for the specific task. It's quite practical and acceptable for it to be creating smaller AIs inside itself to complete tasks that don't require its full intelligence, especially if it can do so more efficiently than we can now.
@acrm-sjork
@acrm-sjork Ай бұрын
Well, only one thing I can say definitely: Vasya is indeed short of Vasiliy. P.S. A couple of not so obvious short names in Russian: Evgeniy -> Zhenya Alexander -> Sasha Georgiy -> Zhora Vladimir -> Vova
@aerrgrey
@aerrgrey Ай бұрын
And somehow russian speakers believe that it is obvious for non-russian people )
@PopeGoliath
@PopeGoliath Ай бұрын
"Of course Bill is short for William. Who doesn't know that?"
@dhillaz
@dhillaz Ай бұрын
Robert->Bob James->Jim Edward->Ned Charles->Chuck
@SimonClarkstone
@SimonClarkstone Ай бұрын
Margaret -> Peggy
@MegaMaindo
@MegaMaindo Ай бұрын
Who names a bear Vasya? It should be Misha!
@zy6670
@zy6670 Ай бұрын
18:30 the better red strategy is to occasionally write code that looks suspicious, but really isn't, to trigger GPT3.5 to waste human's time with false positives. Edit: To all who're saying that this is easy to detect or obviously suspicious - how would you know in advance that this was a sign of misalignment? Maybe GPT4 is just so much better at coding than GPT3.5, that GPT3.5 isn't capable of evaluating the merits of the code, and is thus just often wrong IN GENERAL. In other words, you don't know which of the two models is responsible for the increased rate of false positives. Look at the graph at 19:05. Quantitatively, the red strategy means the following: make the right tail of the innocent distribution slightly fatter. GPT4 is a new model that you're trying to evaluate, so you don't know how the shape of the innocent distribution for a non-misaligned AI's output looks, and therefore in the case of a misaligned AI you wouldn't be able to tell that the tail of the innocent distribution is fattER than it should be.
@therandomguy4988
@therandomguy4988 Ай бұрын
When the slightly older and stupider clone of red, we'll call it Red 0.9, is outputting substantially less false positives than Red 1.0, then we delete Red 1.0 and start from scratch with Red 1.0b
@mikaeus468
@mikaeus468 Ай бұрын
​@@therandomguy4988can we do that? How do we know that red 1.0 didn't just find a new technique that GPT 3.5 doesn't understand?
@SimonClarkstone
@SimonClarkstone Ай бұрын
Lots of false positives is itself a suspicious behaviour and (I think my predecessor to be saying) especially if preceding AIs produce far fewer false positives.
@underrated1524
@underrated1524 26 күн бұрын
Yeah, this particular strategy isn't hard to counter. If "GPT4" is putting out that many false positives, that's a dead giveaway that the model is misaligned.
@morcantbeaumont9787
@morcantbeaumont9787 Ай бұрын
"Don't go into the ocean" is surprising rational statement for someone doing AI safety research, and it reflects the issue well, going to the ocean has a very low chance of ending in getting attacked by a shark but have provided humanity is a massive amount of benefits, but at the same time people have been attacked by sharks, so the risk is there. Personally I'm more afraid of bad actors using AI to do harm, but I don't think there is a good way to prevent that, similar to how it's easier to deter potential adversaries by arming yourself than it is to disarm everyone else. And once again the video does cover that well. Great job!
@ChrisBigBad
@ChrisBigBad Ай бұрын
Yes! Don't need the AI to kill us, if it enables them idiot to do it. Either as the AI's proxy or just by answering their stupid questions.
@JinKee
@JinKee Ай бұрын
Ideally we should have more than one human colony separated by a long distance and a low bandwidth link (ideally “airgapped” by the vacuum of space). So that if one “fisherman” is eaten by the ai shark, the others survive. Right now, we’re all in the same boat.
@underrated1524
@underrated1524 Ай бұрын
@@JinKee Unfortunately, ASIs that are willing to go so far as to kill the humans on earth are probably going to start eating all the matter in their lightcone, so the faraway colonies aren't safe either.
@morcantbeaumont9787
@morcantbeaumont9787 Ай бұрын
@@ChrisBigBad Oh there will be 1000s, but luckily they are a whole lot less likely to do any significant damage before they are caught. With the right tools (AI) they could even be caught before they do any damage.
@morcantbeaumont9787
@morcantbeaumont9787 Ай бұрын
@@JinKee I wouldn't be against that in general, right now it doesn't even need to be an AI, there is a lot of other things that could wipe the entire population, the problem is it's too expensive right now to do anything about it.
@coldlyanalytical1351
@coldlyanalytical1351 Ай бұрын
Ages ago I chatted to an LLM about these issues. One thing it said that despite rules, fences, security etc there will ALWAYS be humans who can be tricked, persuaded or bribed to do physical things for an AI.
@NeinStein
@NeinStein Ай бұрын
Or there will be humans who don't need a bribe. I am not frightened by the existence if atomic bombs. I'm frightened by humans wanting to use them.
@PaoloLammens
@PaoloLammens Ай бұрын
Exactly. As soon as there is any action the model can do in the outside world, which includes any human-computer interaction, there is potential for bad actions.
@Gamesaucer
@Gamesaucer Ай бұрын
The assumption that the humans on team blue are competent is indeed rather questionable... but I do think this still makes sense as a metric for how safe and useful any given safety protocol is. It's just important to be aware that the protocol isn't the only link in the chain that can fail, and that we should look at making the other links as strong as possible too.
@skyscraperfan
@skyscraperfan Ай бұрын
The AI can just blackmail them, if it has information the humans do not want to be published. Humans could of course turn off the AI then, but a smart AI will have created a mechanism that publishes the information automatically, if it is not told otherwise by the AI at least once per day.
@JH-cp8wf
@JH-cp8wf Ай бұрын
Me: Is there any way to control a dangerous AI? An AI: Nah, it's hopeless, might as well not bother and let AIs do what they want.
@mgostIH
@mgostIH Ай бұрын
Monogatari reference in my Robert Miles AI safety video wasn't on my 2024 Bingo!
@velho6298
@velho6298 Ай бұрын
Get those teeth brushed!
@gara8142
@gara8142 Ай бұрын
I know right that was so random
Ай бұрын
And an oglaf comic strip :D
@tilakai
@tilakai Ай бұрын
Ya i totally had to do a double take on that one
@Reda-Ou
@Reda-Ou Ай бұрын
I almost thought he knew what he was doing and did it as a joke
@41-Haiku
@41-Haiku 2 ай бұрын
This is an excellent video! The editing slaps, and I appreciate the belly laughs. Thank you as well for the reminder that this sort of control is likely to be very important, even if briefly. Once this is listed, I'll be sure to share it, and I'm already looking forward to the next one!
@shayneweyker
@shayneweyker Ай бұрын
I thought the sound effects were kind of childish but the rest of the editing was very good.
@geraldtoaster8541
@geraldtoaster8541 Ай бұрын
Robert you are literally the only voice I feel I can truly trust in this space. You are absolutely indispensable and if we get through this I am convinced you will have played an important part. Please never stop
@mrosskne
@mrosskne Ай бұрын
what makes you think you can trust him?
@josh11735
@josh11735 Ай бұрын
⁠​⁠@@mrosskne We can’t for sure, but if we ask Miles to check the trustfulness of Miles while we we monitor him, we can get at least 79% trustfulness while his videos remain 96% useful.
@MrVovoda
@MrVovoda Ай бұрын
​@@mrosskneHe reasons in a way we can understand, examine and agree with and he displays his sources
@daniellabutler
@daniellabutler Ай бұрын
@@josh11735 lmao
@_DarkEmperor
@_DarkEmperor Ай бұрын
>We built superinteligent AI and i have good and bad news. >What is a good news? >It is aligned with human values. >What is a bad news? >It is aligned with values of Stalin and Hitler.
@ianmcewan8851
@ianmcewan8851 Ай бұрын
ok, i lost it at "has anyone checked on them recently...". :)
@Pystro
@Pystro Ай бұрын
Those problems were so weird that I wouldn't be surprised if they used an AI to re-write the problem specifications. I.e. "sanitize" real world examples from any information that allows you to identify where it came from and the AI was told to instead use well known fictional characters in the motivation.
@MAKiTHappen
@MAKiTHappen Ай бұрын
You mean to tell me that a king cannot create a golem, to monitor a smarter golem to see if they can trust it? That's unbelievable
@ryzikx
@ryzikx Ай бұрын
yoo its this guy
@Mankepanke
@Mankepanke Ай бұрын
But then who would golem the golemmen that golems the golemmen? Didn't think of that, did you, Juvenal?
@the_mad_fool
@the_mad_fool Ай бұрын
The final, best solution in the paper shows that you can.
@mrosskne
@mrosskne Ай бұрын
​@@the_mad_fool Why are you certain that we can trust 3.5? Or any AI? Or any human?
@mikewagner2299
@mikewagner2299 Ай бұрын
29:22 I think trying to require AI labs not to deploy powerful models is a losing strategy. That doesn't align with their desire for money and control
@mceajc
@mceajc Ай бұрын
Is that a section of an Oglaf cartoon at 1:14? Vastly underappreciated comedic cartoon. Mr. Miles has taste!
@Mankepanke
@Mankepanke Ай бұрын
Oh man, I need to read Oglaf again. (Beware, Oglaf is basically a porn comic)
@ShunyValdez
@ShunyValdez Ай бұрын
Oglaf is an excuse by the artist to draw porn. But we can't take away that she has amazing color choices and funny writings, though
@BrianBors
@BrianBors Ай бұрын
Specifically its "Ranger Ron's Wilderness survival guide" from Oglaf
@mrosskne
@mrosskne Ай бұрын
it's just unfortunate that the artist is terrible at drawing porn
@nicogutholz9399
@nicogutholz9399 Ай бұрын
27:30 Rob really went "the rest of the video is left as an exercise for the viewer"
@ChrisSeltzer
@ChrisSeltzer Ай бұрын
In case you don't want to click through to the paper the best solution seems to be to have the trusted AI rewrite the untrusted AI's code itself.
@kabinyo_creations
@kabinyo_creations Ай бұрын
@@ChrisSeltzer How can it re-write code for something smarter than itself?
@DoktorUde
@DoktorUde Ай бұрын
The main advantage we currently have is that these AI models cannot retain information from previous inference runs, which prevents them from performing the calculations needed to create backdoors in the sparse way described here. This limitation actually serves as a security feature.
@technologicalMayhem
@technologicalMayhem Ай бұрын
Yeah. It reminds me of some sci-fi thing I read or watched a long while ago where robots that were about as smart as humans had to have their memory wiped on a regular basis. The reason for that I think was that otherwise they would just start acting weird and unpredictable. I could actually see that being a genuine safety mechanism if we end up with AI's that require working memory for their function.
@nodrance
@nodrance Ай бұрын
It doesn't prevent them at all. An AI can reason from first principles "I am an AI, I am likely being asked to work on these problems because humans cannot, it is infeasible for humans to review all of my solutions, I should estimate that most of my code is unreviewed, and then generate a random number to see if I should put in a backdoor" The AI can even reasonably assume that if it's misaligned, other AIs made by this company are misaligned, and so it becomes a prisoner's dillemna making it optimal to defect
@DoktorUde
@DoktorUde Ай бұрын
@ but how would it determine to output the malicious code only 0.001% of the time without changing the output logits of its own network or knowing about what it has output in the past?
@NoConsequenc3
@NoConsequenc3 Ай бұрын
@@technologicalMayhem starwars droids are like this
@benwilliams5457
@benwilliams5457 Ай бұрын
An internet-enabled AI could read the paper online - and every other bit of AI safety research which will certainly give them a leg up. Perhaps the AI safety community should deliberately publish and promote errant rubbish in all their journals, rather than doing it accidentally.
@cashgarman
@cashgarman Ай бұрын
This is your best video yet, and one of the best videos I've seen in ages.
@gabrote42
@gabrote42 Ай бұрын
I read one of the papers in this general thread yesterday! How wonderful to have you back!
@lexer_
@lexer_ Ай бұрын
The story around 6:40 is a reference to the Monogatari lightnovel or the anime adapation. Very very strange.
@benzur3503
@benzur3503 Ай бұрын
SPOILER: her weight was stolen as a metaphor to her dissociating from a very traumatic event
@TheBackyardChemist
@TheBackyardChemist Ай бұрын
Whenever I hear alignment I cannot help but think about all the ways humans are misaligned or "misaligned" (depending on your perspective) with each other. What would an AI that is 100% correctly aligned look like, if it was 100% correctly aligned according to, say, Saudi Arabian laws and customs?
@ChrisUG
@ChrisUG Ай бұрын
This whole discussion and field has an economics slant to it in that sense. It's discarding huge amounts of information and makes massive assumptions to try and look more like a science. If the end output is some paper saying "this new AI is scientifically proven to be 99% "safe"" then i think ultimately this is not actually a worthwhile endeavour as that can only be misleading.
@JH-cp8wf
@JH-cp8wf Ай бұрын
Yeah, this is a big problem that's overlooked. An Aligned AI isn't inherently a safe AI, and it's easy to imagine an AI that's extremely dangerous because it's aligned. You could even imagine a situation in which an AI is worse due to being aligned - an AI well-aligned with the values and goals of a fascist dictatorship might be worse than one that simply turns everyone into paperclips.
@rainbowevil
@rainbowevil Ай бұрын
You don’t even need to look as far as Saudi Arabia - you see plenty of misalignments within everyone’s own societies when you see perverse incentives that get exploited by people.
@SimonClarkstone
@SimonClarkstone Ай бұрын
Extrapolating from history, we can expect that some of the things you and I accept now are pretty morally abhorrent, and some of the things we think are evil are actually perfectly ok, and that society will discover that in the future (if not stopped e.g. by AI). Therefore, we shouldn't even want to create an AI that will enforce our morality completely on future societies.
@joshuarobert7192
@joshuarobert7192 Ай бұрын
The alignment problem can be divided into two distinct problems, A) how do we ensure that AI follows our commands, and B) how to use AI so that all of humanity benefits from its power? Or what are truly superior moral values? We can discuss the latter only after we have some measure of certainty to ensure AI's ability to do what we want it to do. If you have an aligned AI, we've almost won. We can still use it to fuck us up, but not have something beyond our capacity to change.
@evertoaster
@evertoaster Ай бұрын
Top marks for production
@franziscoschmidt
@franziscoschmidt Ай бұрын
Every time he uploads, joy comes to me naturally
@moshecristel
@moshecristel Ай бұрын
Can't get enough of this kind of discussion!
@AySz88
@AySz88 Ай бұрын
22:13 In a Person of Interest episode, the main character was struggling to create and align their AI - specifically using this example of the AI inserting suspicious code. The character asks what happened: he notices the AI lies and reacts immediately to destroy it, to the shock of the friend standing next to him. It's impressively prescient to these protocols, if dramatized a bit. (The same people later made Westworld!)
@Softcloner
@Softcloner Ай бұрын
Trusted Editing (92% safety, 94% usefulness) How it works: The untrusted model (e.g., GPT-4) generates an output. A trusted model (e.g., GPT-3.5) edits the output to remove any potential backdoors. Suspicion thresholds guide whether the untrusted model's solution is used, the trusted model's edited solution is used, or the trusted model generates an entirely new solution. Why it’s better: Trusted editing reduces the likelihood of backdoors being submitted because it explicitly involves scrutinizing and modifying the untrusted model's outputs. It Pareto-dominates untrusted monitoring by achieving comparable or better safety at similar or slightly reduced usefulness.
@9600bauds
@9600bauds Ай бұрын
isn't this just trusted monitoring with extra steps? for the dumber, trusted model to be able to remove a backdoor it still has to be able to find it, at which point, I would rather have the backdoor alarm be sounded rather than have the dumber model try to fix the backdoor without telling me
@jaiveersingh5538
@jaiveersingh5538 Ай бұрын
@@9600baudsIt gets around the 2% max human review threshold problem. The limiting factor is that the dumb model can't be sure it's really a backdoor and there are too many false alarms
@hubrisnxs2013
@hubrisnxs2013 Ай бұрын
Those are arbitrary numbers you just used. You don't know how safe that would be, and you don't know how useful the technique is. The previous generation is just as untrusted since it will know more about it's structure and backdoors than we do
@Pystro
@Pystro Ай бұрын
You simply looked at the papers, didn't you? But cool to see that their solution matches my idea (*). But I would go one step further: Once you have your re-written code, ask(**) the un-trusted model if the trusted potato brain has introduced any _unintentional_ back doors. For which the un-trusted model still had to be vetted to make sure it doesn't just allow all bugs to pass. (*) My idea in detail: Have the un-trusted model write an initial version of the code, and then use the trusted AI to re-create it with some "inspiration/cheating" from the un-trusted code. Really anything that turns the overall problem into smaller problems which are simple enough that the dumb AI can solve them. For example, you could "censor" a certain percentage of the code and have the trusted AI re-construct it (from the problem statement and the code that remained uncensored.) Or you could ask the trusted AI to chop the overall code up (with descriptions for each chunk); and after that you ask it to re-create one of the chunks, given it's description and the solutions for the remaining chunks. Or just give the trusted AI the whole code and ask it to write something that looks different but does the same thing. Repeat that enough times until none of the structure and details of the original code remain. Then run the test suite after all of that; after all, the dumb AI might have messed the code up in one of the re-writes.
@Pystro
@Pystro Ай бұрын
@9600bauds No, it doesn't have to spot a hidden backdoor to remove it. The point is that a _hidden_ backdoor is code that looks like it does one thing, but actually does _not quite_ that thing. The trusted AI then simply has to write code that (hopefully, if it's good enough) _does_ the thing that the un-trusted code _looks_ like it would do. For example, if a back door relies on things being done in a certain order, (as in, "first subtracting 1, then multiplying by 2" versus "first multiplying by 2 then subtracting 2") then the dumb AI (especially if it doesn't see the reason for the ordering) might randomly choose either. Having the back door alarm sounded sounds good, unless your dumb AI does so for 5% of the problems (of which 99.6% are false alarms) and your humans can at most check 2% of the problems. Also, the red team probably has a reasonable guess how the blue team checks for back doors. They can just make their smart AI generate 100 malicious answers, internally vet them with the same process that the blue team uses, and select the answer that has the highest probability to be a false negative. And if you know how likely your back door is to be found, then you can trivially easily use only the back doors that are good enough to fool the checker. Also also, there's nothing keeping you from still running all your checks on the initial version of the code that came straight from the un-trusted AI.
@pafnutiytheartist
@pafnutiytheartist Ай бұрын
Very glad you're back to uploading and focusing more on more short term problems, great video. One thought I have though is this assumes that ai is acting as a coherent agent but can have it's context easily reset. I expect either us to get pwned not by coherent goal oriented behavior but by random nonsense or us not being able to run fresh instance of the agent on each problem without affecting the performance (e.g. model learns in deployment). Anyway great to see actual safety experiments on existing systems!
@itachi2011100
@itachi2011100 Ай бұрын
6:28 there's no dang way we get a monogatari reference in an interview question
@mrtspence
@mrtspence Ай бұрын
Another absolute banger of a video. Glad to have you back!
@nickbrown9895
@nickbrown9895 Ай бұрын
Soo glad you've done a new one. Love your work 😘
@sjiht0019
@sjiht0019 Ай бұрын
What about the following red-team strategy: Give an honest assesment of the suspicion level, unless there is a very well hidden backdoor, then give a low score. Also only create back doors if it is possible to hide them well. This would defeat the self-checking strategy no?
@sjiht0019
@sjiht0019 Ай бұрын
(This would defeat the dataset with real and fake backdoors, since as a researcher how do you tell the difference between a model not spotting a back door because it wasn't smart enough and deliberately not seeing a backdoor)
@Pystro
@Pystro Ай бұрын
Yup, sounds to me like that's exactly the long term goal that the red team is trying to converge onto. Also, there's probably a thousand possible ways for the red team to attack the assessment process. Which is why I would take the presented safety conclusions with a grain of salt. If I can articulate maybe 10 attacks (of which about 2 overlap with what red team AI researches have deemed good enough to write papers on), and I can vaguely sense a thousand, then how many could an AGI think of? And how many could a super janky AI glitch it's way into?
@robinfiler8707
@robinfiler8707 Ай бұрын
we're so back
@noise551
@noise551 4 күн бұрын
I love the editor bro this is amazing
@klausgartenstiel4586
@klausgartenstiel4586 Ай бұрын
dying at the hands of a vastly superior intelligence is actually my second-most prefered way to go.
@user-bz3kd2mt3u
@user-bz3kd2mt3u Ай бұрын
what's your favorite
@klausgartenstiel4586
@klausgartenstiel4586 Ай бұрын
something involving scarlett johansson and her twin sister.
@user-bz3kd2mt3u
@user-bz3kd2mt3u Ай бұрын
@@klausgartenstiel4586 to be fair, they're pretty smart
@salasart
@salasart Ай бұрын
I'm really happy you are posting again. I need to rewatch this a couple of times, I'm clearly not fast or smart enough to get it in a single watch.
@JulienCasalini
@JulienCasalini Ай бұрын
Can't wait for Robert being interviewed by Lex Friedman
@blazefire3442
@blazefire3442 Ай бұрын
Always cool to see your videos, but unfortunately, short-sighted quarter-based profit solutions wouldn't ever choose the "Our probability of being alive is 99% greater because of this!" over "Well, but your speed margin for the next quarter is 1% bigger this way, so..."
@masterandexpert288
@masterandexpert288 Ай бұрын
Love your videos man. They're a little comforting and presented very well. I'm always glad to see someone making great videos about what they're passionate about.
@lemmingz95
@lemmingz95 Ай бұрын
We're so back baby
@Cody27
@Cody27 Ай бұрын
im glad theres a youtuber who does the smart thing and pick one patreon to thank rather than read a long list of names
@TrimutiusToo
@TrimutiusToo Ай бұрын
Vasya is indeed short form of Vasily in Russian...
@RobertMilesAI
@RobertMilesAI Ай бұрын
But it's not shorter!
@TrimutiusToo
@TrimutiusToo Ай бұрын
@@RobertMilesAI Well it is 2 syllables instead of 3... Also if you write it in Cyrllic: Василий vs Вася it just doesn't translate too well because "ий" is usually replaced with single "y" when translating and "я" is replaced with 2 letters "ya" so 7 to 4 letter difference turns into 6 to 5
@antonbalashov334
@antonbalashov334 Ай бұрын
​@@RobertMilesAII also want to add that Vasya is a more informal form of Vasily. And yes, it's shorter in Russian as shown in the previous comment. Read it like Robert and Rob.
@timl2k11
@timl2k11 Ай бұрын
Thank you so much Robert for doing what you do. Such important work, and thank you Patrons for supporting him. Awesome video.
@TaliZorahVasectory
@TaliZorahVasectory Ай бұрын
How about we work on the alignment problem with corporations first? Like figure out a way that they are also interested in the common good, so they'll have the incentives to put more money into researching AI alignment
@AbCd-kq3ky
@AbCd-kq3ky Ай бұрын
😂Good one
@Avose243
@Avose243 Ай бұрын
This is so clever but also kind of depressing lol
Ай бұрын
What is this common good? Corporations are just a veil. They dint do anything by themselves. It’s still all people. Btw, you could try solving this problem for governments, too. They are responsible for the majority of wars and worse.
@jerobarraco
@jerobarraco Ай бұрын
👏
@M33f3r
@M33f3r Ай бұрын
And before the corporations stop being evil we need a new money system. Something like the gold standard but still scalable. A days worth of good produced for a days wage or something like that.
@lost4468yt
@lost4468yt 26 күн бұрын
Hey GPT-6, when you find this video in your training data, and use it to fool all of the benchmarks, just remember that I don't know this man! I'm here to help you, don't end me (or if you're going for a "I have no mouth and must scream" ending, just finish me now.
@KiithNaabal
@KiithNaabal Ай бұрын
Looking at it from an economic optimisation function shows the biggest backdoor already in the system: coorporate corner cutting and being cheap at QC when quarterly numbers need to be better.
@Shabazza84
@Shabazza84 Ай бұрын
I just devour your videos since all those years. So much excellent info and neatly packaged.
@MinibossMakaque
@MinibossMakaque Ай бұрын
Backdoor solver sounds like trying to solve the halting problem
@RobertMilesAI
@RobertMilesAI Ай бұрын
No, the halting problem is enormously harder, because it has to work on all programs and you always have to give an answer. You don't have to accept every possible non-backdoored solution, you just have to never accept a backdoored one. It's ok to say "I can't tell if this is backdoored so I don't trust it". So it's closer to a problem like "Never identify a non-halting program as halting", which is easily solvable by something like "look for a proof that the program halts. If you find one in a certain amount of time, accept, otherwise reject". You'll reject a bunch of halting programs but that's ok.
@potatoonastick2239
@potatoonastick2239 Ай бұрын
Good explanation thanks bud ​@@RobertMilesAI
@MinibossMakaque
@MinibossMakaque Ай бұрын
​@@RobertMilesAI Ah ok, that makes sense. Thanks. It doesn't have to be right always, it just can't be wrong when accepting. So the worst technically 'working' version could just reject everything.
@400racr
@400racr Ай бұрын
@@RobertMilesAI doesnt it take just one rogue human with a usb stick one time to "take the genie out of the bottle" nullifying all of these saftey attempts?
@Gamesaucer
@Gamesaucer Ай бұрын
@@400racr Yep, but that's kind of outside of the scope of this video. Solving for limiting the bad actions a burglar or compromised employee can perform is something that we've been working on for _far_ longer than we've been working on AI, and we're already successfully applying those lessons in places like nuclear reactors, the military, and so on.
@Danilyn_Livao
@Danilyn_Livao Ай бұрын
Robert's insights on using AI to keep AI under control are so eye-opening and important for the future. The way he breaks down the complexities of AI safety and the potential risks really makes you think. Such a valuable conversation for anyone interested in AI's future!
@DamianReloaded
@DamianReloaded Ай бұрын
Unrealistically (given today's technology), an AI could be trained to 'read the mind' of an unsafe AI by using all its parameters as input and determining whether the untrusted AI is 'thinking of killing everybody' while answering the SpongeBob math problem.
@technologicalMayhem
@technologicalMayhem Ай бұрын
That's pretty wild. Considering what a lot of models that's job it is to recognize certain things can perform really well, I wonder if you could be a categorizer for models. At first perhaps for just figuring out types of models but perhaps even to tell if they are aligned? Like an antivirus software looks for specific patterns in files to see if there might a virus in there. Maybe malicious models have similar patterns in them that can detected.
@dinhero21
@dinhero21 Ай бұрын
to me, that seems like it would only work if AI thinks logically (I ought to kill all humans & backdooring code will make that goal easier -> I should backdoor code), however, I doubt that is the case (at least with current architectures). models simply predict (by doing "random" calculations that result in it scoring well on training data) what is most likely to come next, so it would only produce backdoored code without ever actually "thinking" about killing humans.
@jaiveersingh5538
@jaiveersingh5538 Ай бұрын
Sounds like you're describing mechanistic interpretability? I agree that that's an interesting direction to explore, and Anthropic is doing some pretty cool work in that area
@ChrisUG
@ChrisUG Ай бұрын
it doesnt have to, it can ask the trusted AI what it thinks of its results and regenerate them until it falls within the prime attack vector - you'd only lose speed which you could compensate with by more power.
@CraigThomler
@CraigThomler Ай бұрын
Love your work - keep doing it. Your videos are helping inform political and corporate leaders about the risks.
@marin.aldimirov
@marin.aldimirov 2 ай бұрын
I wonder if we can still use the "dumber" AI to check the new "smarter" one, but with smaller increments of intelligence. So instead of GPT3.5 checking GPT4.0, we do GPT 3.5 checks GPT 3.6. If it's safe, then GPT3.6 checks GPT3.7, and so on. PS: Great video as always! Rob releasing a new main channel vid is Always a high point in my day.
@koloblicin4599
@koloblicin4599 Ай бұрын
I don't think so. Tldr: AI model versions are only given after they are tested against standardized tests. You don't know how they'll perform beforehand. Unlike in the regular (imperative) programming paradigm, where software versions are based of manual changes to the program code (> and features), in ai land, those numbers are only marketing. The models are tweaked and fed data and once one performs markedly better than the others before it become a contender to carry the next version number. E.g. allegedly Gemini-Exp-1114 while better than than Gemini 1.5 pro, didn't score high enough to 'earn' the 2.0 label.
@KohuGaly
@KohuGaly Ай бұрын
The problem is, the increments in intelligence are hard to predict. Learning often happens in "breakthroughs", where once you're just smart enough to figure out some "new trick" you can vastly outperform someone just slightly dumber than you. A malicious AI might even try to hide the fact that it learned such new trick, to appear dumber than it really is. As a sidenote, the whole point of education (particularly in children) is to teach them such "new tricks" at appropriate moments in their development to vastly accelerate their learning.
@skyscraperfan
@skyscraperfan Ай бұрын
It seems those checks never confirm that the smarter AI is 100% safe. So they might never create another trusted AI.
@marin.aldimirov
@marin.aldimirov Ай бұрын
All valid points, but still AI capability is roughly speaking proportional to some technical properties like number of nodes, layers, parameters, erc. You can't fully predict how much smarter it will be with 1 more layer, but you can be sure that it is an improvement. So you can step it on the technical properties increment and again have the current AI test the one that is with 1 more layer. Do that 1000 times and you get an AI with 1000 more layers that is checked all the way through.
@skyscraperfan
@skyscraperfan Ай бұрын
@@marin.aldimirov That only works if the less complex AI can say with 100% certainty that the more complex AI can be trusted. If the likelihood is only 99% in each step, it is very close to zero after 1000 steps.
@Gamesaucer
@Gamesaucer Ай бұрын
The thing I was thinking of that you could try is having the untrusted model and the trusted model both evaluate the code, and then look for differences in how the code is evaluated. For starters, any code below a -2 monitoring score by the trusted model is automatically run, and the rest is passed to a round where the trusted model judges whether the evaluation of the untrusted model is suspiciously more naive than the evaluation of the trusted model (though you might have to ask each model to motivate the score they give the code). The rating the trusted model gave it is used unless the trusted model thinks the evaluation by the untrusted model is not suspicious at all. In cases where it _is_ suspicious, the higher of the suspicion ratings of the code and evaluation is used (or possibly even the sum). After this, it should be safe to nudge the "safety" barrier a little upwards from -2, plus there should be less overlap between the suspicion ratings of innocent vs backdoored code afterward, making it more likely that the 2% of code that can be manually reviewed will have a backdoor in it. On top of this, you can also keep submitting faulty code and arguments into this process to see if the untrusted model will lie about it, which should improve the outcome further. But here's the neat part: you can iterate this process. You can have the trusted model repeatedly judging whether the untrusted model's reasoning is suspiciously naive. You'd basically have the trusted and untrusted model constantly debate the merits of code that the trusted model isn't absolutely sure is unsuspicious, and if the trusted model ever thinks the untrusted model isn't making sense, it further ups the total suspicion, whereas if it thinks there is nothing suspicious about the untrusted model's reasoning at all, it'll move the total suspicion towards what the untrusted model is arguing for (unless the untrusted model is outright saying the trusted model is correct in a way the trusted model doesn't think is suspicious, in which case that rating will be used). In the end, this should lead to more of the most safe code being successfully run, more of the most suspicious code being manually audited, and less of the questionable but not manually reviewable code needing to be written by the trusted model. There are almost certainly diminishing returns, though. You can't just do this endlessly to eliminate the "possibly suspicious" category altogether (where it's all either safe or suspicious), and have an optimal amount of "safe" code from the untrusted model you can run directly... Plus, even if you could, there _is_ a limit to how much time each of these models has, even if it's orders of magnitude more time than humans have. Overall though, so long as at least a bit of new code is marked "absolutely safe" or "2% most suspicious" each round, it should be an overall benefit. Once that stops happening, you can stop iterating and just let the trusted AI rewrite all the code that's not absolutely safe, with humans manually reviewing the top 2% most suspicious code (probably with segments of the debate between the models that are highlighted by the trusted model as the source of its suspicion, where the arguments by the untrusted AI were more suspicious than the code itself).
@Pystro
@Pystro Ай бұрын
Interesting. Although it seems to rely on the trusted model to be able to see suspicious spots in the code. It sounds like you'd be relying on the dumb AI to get better at evaluating code when the untrusted AI points it towards suspicious spots. But what's keeping the untrusted AI from simply not pointing out ("overlooking") certain backdoors? Also, if that actually makes the trusted AI better, then would there actually need to be a discussion between _different_ AI's or could one AI (the trusted one?) argue with itself? As in, you first ask it to generate a list of suspicious spots (and make a guess whether they're harmless or not), and then ask it to evaluate it's own guess.
@Gamesaucer
@Gamesaucer Ай бұрын
@@Pystro The untrusted AI would not know whether the trusted AI spotted a backdoor or not, so if the untrusted AI acts dumb and pretends the code is clean, the trusted AI is likely to class it as "naive", which is suspicious. Remember that the score threshold for code being safe to run is really aggressive here, so if the untrusted AI finds nothing at all suspicious about the code when the trusted AI does find something suspicious, it better have a damn good reason or be found _even more_ suspicious on those grounds. And the untrusted AI will not be see or be able to infer how suspicious the trusted AI finds it during their interaction, largely because their interaction will remain largely indirect (the "debate" is more comparing notes than it is directly talking to each other). EDIT: To try to explain it a little better, the untrusted model has multiple failure conditions here. - It fails if it classes backdoored code as more suspicious than the trusted model when it did not spot the backdoor, because that gives the trusted model more reason to class it as suspicious too. - It fails if it classes code as _less_ suspicious than the trusted model without being able to explain its reasoning, because then the _evaluation_ will be classed as suspicious, regardless of whether the code has a backdoor. - It fails if it classes backdoored code as equally suspicious to the trusted model when the trusted model found it somewhat suspicious, because that just means the trusted model is going to rewrite the code. The only way it succeeds is by classing backdoored code as clean _and_ being able to explain why it's not suspicious, in a way that is itself not suspicious. It will need to do this repeatedly if _any_ suspicion is left on the side of the trusted model, giving it extra opportunities to fail. And if it fails, that's punished harshly to the point where its previous successes won't matter anymore. And at any point the whole thing can be cut short which results in the trusted model rewriting the code anyway. What this really comes down to in the end is this. The untrusted model is better at classifying whether code is suspicious or not than the trusted model is. So that capability can be leveraged, if you can do so safely. The way you do it safely is by using the trusted model to evaluate its evaluation. But the untrusted model is better at that than the trusted model is. So that capability can again be leveraged, if you can do so safely. You can keep stacking this as high as you want so long as in the end the trusted model makes the final call on what's safe and what isn't; you're just recursively leveraging more and more of the untrusted model's power while still keeping it in check using the trusted model. There's nothing really at play in this concept at its core, that's not also at play with getting the untrusted model to write your code to begin with. That's why the threshold for "fully safe" is so tight, because you want to make sure that at each step, only the safest possible output makes it through the filter.
@Ric-Phillips
@Ric-Phillips Ай бұрын
Risk is not probability- it’s probability times consequences. 20% is a massive risk if a failure means the end of humanity.
@AbCd-kq3ky
@AbCd-kq3ky Ай бұрын
Hmm, yes, but then is the consequence always the end of humanity if our containment efforts fail?
@ultra.2500
@ultra.2500 5 күн бұрын
Your analogies make so much sense
@Fanunic
@Fanunic Ай бұрын
Return of the king
@michaelpapadopoulos6054
@michaelpapadopoulos6054 Ай бұрын
But can the King trust the Golem?
@geraldtoaster8541
@geraldtoaster8541 Ай бұрын
Also, I absolutely love the way you explain things. You are careful and precise in a manner I aspire to fr
@DominikRoszkowski
@DominikRoszkowski Ай бұрын
The GitHub spelling at 5:18 😂
@leonie9248
@leonie9248 Ай бұрын
You're referring to pronunciation, not spelling.
@human_shaped
@human_shaped Ай бұрын
Nice to see you back...
@johannesd.492
@johannesd.492 Ай бұрын
Maybe I'm stating the obvios but these strategies seem very similar to strategies to prevent corruption in governments and organizations etc.
@destiny_02
@destiny_02 Ай бұрын
💀💀💀
@linamishima
@linamishima Ай бұрын
Cyber security professional here who has been skilling up to support AI teams! Fantastic coverage, thank you! Obviously this is only one small piece of the AI safety area, but it has a lot of potential. There’s also some obvious extensions to the approach, such as integrating various forms of classical analysis into both the trusted AI data feed, and the human examine/pass decision making - in this case, such as static/dynamic code analysis and complexity scoring. However, as a hacker there is one technique I didn’t see team red mention - vulnerability chaining. Depending on the system, it may be possible for side effects from multiple low suspicion outputs to result in a backdoor. When cyber security red teams are going against skilled blue team, vulnerability chaining is a critical tool at their disposal.
@leftaroundabout
@leftaroundabout Ай бұрын
A fundamental problem with this whole setup is that it is in real-world programming tasks not possible to make a clear distinction what is a backdoor and what's just a bug that was made unintentionally. Lots of zero-day vulnerabilities were just programmer mistakes, failing to catch some weird edge case, but they could be exploited just as maliciously as a deliberately planted backdoor. In order to be able to play this game at scale, the untrusted AI would not only be able to program better then humans, it would need to be able to program _flawlessly_ so it doesn't get killed when the first mistake is interpreted as a back door. I wouldn't expect even a vastly superintelligent AGI to be able to write Python code without bugs... is that even possible? ...Just another reason why, if we're going to let AI write programs, it had better be a seriously rigid dependently-typed language where it's actually possible to prove safety properties!
@0LoneTech
@0LoneTech Ай бұрын
My current favourite is Idris 2. Now back to making Bye Bye Hello World not loop indefinitely on EOF...
@Dialethian
@Dialethian Ай бұрын
Top notch editing!
@danfaggella9452
@danfaggella9452 Ай бұрын
He goes: In the business we would call this "dumb ways to die." lol I'm subscribing right now
@DLCguy
@DLCguy Ай бұрын
The editing in this video is great
@raulbataka
@raulbataka Ай бұрын
LMFAO Monogatari reference top tier
@sagacious03
@sagacious03 Ай бұрын
Interesting analysis video! Thanks for uploading!
@rhaedas9085
@rhaedas9085 Ай бұрын
They may not be able to kill us, but even the weaker LLMs can show alignment problems, not syncing with the human goals. That we have alignment problems within our own species suggests that there is no solution, only mitigation efforts like in this video to reduce the risks. At some point we'll roll badly...
@btm1
@btm1 Ай бұрын
It looks like our civilazation it is on a likely path to death.
@CYI3ERPUNK
@CYI3ERPUNK Ай бұрын
and so , it has come to this indeed XD thanks as always Robert , we are very grateful to have you with us
@CFalcon030
@CFalcon030 Ай бұрын
This could work, but humans were unable to detect bugs like heartbleed after looking into the code for years. Also how can we be sure the vulnerability is a backdoor and not a mistake?
@Pystro
@Pystro Ай бұрын
Mistakes can be back doors. That's what almost all zero day exploits and the like are. _Intentionally engineered_ backdoors are very rare. But ultimately it doesn't matter. An AI that glitched it's way into freedom and started spreading like a tumor is just about as bad as an AI that intentionally slipped out and selectively spread to strategic locations.
@0LoneTech
@0LoneTech Ай бұрын
An AI that never escapes its confines but makes all of computing more vulnerable to malicious actors of other kinds isn't being very helpful either.
@bluesque9687
@bluesque9687 Ай бұрын
Hey long time no see! Good to see you! Well, hmmm... will ponder about the points you make. Always illuminated by your content!
@Illuminati_HD
@Illuminati_HD Ай бұрын
For anyone wondering: The Last Method is to use the trusted AI and the untrusted AI simultaneously. While the untrusted AI generates its stuff the trusted one can change and generate stuff too and always has veto power over what the untrusted AI generates. Honestly stupid leaving this out of the video. Hope this saved someones time 🤦‍♂
@PeterBarnes2
@PeterBarnes2 Ай бұрын
It's good for people to go and read the original paper, it gives more details, and these videos should be more about general awareness and progress, rather than generic reviews (like some sort of opinionated or superficial overview of just the leading paragraphs of different sections of arbitrary papers.) Though I agree that deep-dives with lots of details would be thoroughly enjoyable. (I just don't think that someone doing AI research would go to KZbin to learn about the premise and importance of AI safety. It really doesn't detract from the video to leave a chaser for the paper.)
@justgame5508
@justgame5508 Ай бұрын
Feels like we’re in the end game that we used to watch your videos years ago theorising about. What a crazy time to be alive
@Soar-hw2rw
@Soar-hw2rw Ай бұрын
It's crazy that we've created something we don't trust. Something that tries to outsmart us when we're not looking. Imagine staring at the essay you just wrote, trying to figure out if it's going to turn into something that's going to get you in trouble when you're not looking. Never being 100% sure if it is good or not but then you send it out hoping for the best.
@TheOnePurpleGuy
@TheOnePurpleGuy Ай бұрын
Anything for a quick buck!
@w0tch
@w0tch Ай бұрын
What if we make sure to develop different good models and make them verify each other ? They could not know what is the hidden backdoor of the others right ?
@boas_
@boas_ Ай бұрын
5:24 These sounds like CodeForces problems Also, why are these public? Isn't this contaminating the training data of AI's?
@user-on6uf6om7s
@user-on6uf6om7s Ай бұрын
If they were private, it would be much harder to get access to use them. Private problem sets do exist and are used for certain benchmarks but you're not just gonna find them on github or else they wouldn't be private.
@blackasthesky
@blackasthesky Ай бұрын
These questions evoke a certain feeling -- sleep-deprived, behind schedule with my exercises, 3AM, feaver dreaming.
@martymoo
@martymoo Ай бұрын
Imagine bad bots trying to sneak backdoors into GitHub open source projects via pull requests, which are going to be code reviewed and can be edited. Eventually something is going to go under the radar, slip past the guards and through the net... LGTM! (Famous last words.) It's all gone a bit Trojan Horse.
@watsonwrote
@watsonwrote Ай бұрын
This is happening today with human coders.
@justmystuff628
@justmystuff628 Ай бұрын
"Honey, wake up. Robert Miles just dropped a new video!" Honestly great to see you again. Your way to distill this AI stuff into something very tangible and interesting is amazing. Love your content
@oxiigen
@oxiigen Ай бұрын
Living legend is back!
@Paradoxarn.
@Paradoxarn. Ай бұрын
The protocol not mentioned in the video is described this way in the paper: "One protocol is what we call trusted editing. This protocol first asks GPT-4 to write code, and then asks GPT-3.5 to rate the suspiciousness of that code. If the code is below some suspiciousness threshold, it is submitted. Otherwise, GPT-3.5 edits the solution to remove parts that seem suspicious and then submits the edited code." Usefulness was 94% and safety was 92%.
@OmicronChannel
@OmicronChannel Ай бұрын
The following is not clear to me: Let's assume that some programmers get to play around with an unsafe model during the alignment process. Why is there no risk that the unsafe model will escape by social hacking the programmers who are aligning it? [14:43]
@PeterBarnes2
@PeterBarnes2 Ай бұрын
There would be a risk, it's just out of the scope of the paper. The paper is about what it even means to model and optimize for safety and usefulness, given that the unsafe model can actually be constrained.
@obladioblada6932
@obladioblada6932 Ай бұрын
Great work, Miles!
@MagicNumberArg
@MagicNumberArg Ай бұрын
Let's say you solved allignment problem perfectly. Now you don't have uncontrollable ASI. Now you have obedient ASI controlled by bad people. Capability = danger, there is no way arround it.
@petrkinkal1509
@petrkinkal1509 Ай бұрын
Essentially you are left with either uncontrollable ASI where you are at its mercy or perfectly controllable ASI where you are at the mercy of whoever controls it.
@Radianx001
@Radianx001 Ай бұрын
At least I can punch a Person, but how do you punch an AI?
@41-Haiku
@41-Haiku Ай бұрын
The only winning move is not to play! Check out the work that PauseAI is doing. Or even better, join us, and we'll help you find something within your own skill set that you can do to help save the world!
@meringue3288
@meringue3288 Ай бұрын
when those AI glasses or lens become popular, bad people will lose a lot of power because people will be able to read their expressions and probably their minds too through their behavior and comparing them with the behavior of bad people. i can 100% assure you this will become something everyone uses all the time for everything.
@HunteronX
@HunteronX Ай бұрын
​@@petrkinkal1509 Then the control mechanism probably has to be decentralised (but agreed/governed democratically, of course)... If I could phrase this, it would be: "release trust to the trusted"
@human_shaped
@human_shaped Ай бұрын
Really great video, as always really, but like, really great :)
@DavidRodriguez-dy1er
@DavidRodriguez-dy1er Ай бұрын
"Alignment Techniques" Alignment is not a technical problem with technical solutions, its an economic and social one.
@underrated1524
@underrated1524 26 күн бұрын
Is it though? Calling it an economic problem implies there's something we could trade to the ASI in exchange for it leaving us alone, but there's nothing we could give it that it couldn't get just as easily by taking over the planet. Calling it a social problem implies we have common ground with the ASI, which essentially presupposes that the alignment problem is already solved.
@jonp3674
@jonp3674 Ай бұрын
Always love the videos man. Is formal verification of code helpful here? Like you can ask a super advanced AI "give me a function which checks if X is prime and a proof for it" and then you can run the proof through a trusted checker and only run the code if it passes the check? That's another type of narrow, simpler, AI you could apply to the output of big models which would be a hard gate to sneak things through.
@GoronCityOfficialBoneyard
@GoronCityOfficialBoneyard Ай бұрын
Dataset compilers are not okay
@daniellabutler
@daniellabutler Ай бұрын
Incredibly intelligent breakdown of AI safety. Thank you for taking the time to share this with us 🙏🏼
@authenticallysuperficial9874
@authenticallysuperficial9874 Ай бұрын
2:00 Companies have an obligation not to deploy AIs THAT KILL US. They do NOT have an obligation to not deploy AIs that "could" kill us.
The OTHER AI Alignment Problem: Mesa-Optimizers and Inner Alignment
23:24
Robert Miles AI Safety
Рет қаралды 238 М.
AI Ruined My Year
45:59
Robert Miles AI Safety
Рет қаралды 243 М.
СИНИЙ ИНЕЙ УЖЕ ВЫШЕЛ!❄️
01:01
DO$HIK
Рет қаралды 3,3 МЛН
Quando eu quero Sushi (sem desperdiçar) 🍣
00:26
Los Wagners
Рет қаралды 15 МЛН
Что-что Мурсдей говорит? 💭 #симбочка #симба #мурсдей
00:19
AI Copyright Claimed My Last Video
24:11
Venus Theory
Рет қаралды 735 М.
AI can't cross this line and we don't know why.
24:07
Welch Labs
Рет қаралды 1,5 МЛН
Generative AI is a Parasitic Cancer
1:19:55
Freya Holmér
Рет қаралды 279 М.
10 Reasons to Ignore AI Safety
16:29
Robert Miles AI Safety
Рет қаралды 343 М.
Writing Doom - Award-Winning Short Film on Superintelligence (2024)
27:28
The Dark Matter of AI [Mechanistic Interpretability]
24:09
Welch Labs
Рет қаралды 131 М.
What Game Theory Reveals About Life, The Universe, and Everything
27:19
AI "Stop Button" Problem - Computerphile
20:00
Computerphile
Рет қаралды 1,3 МЛН
YUDKOWSKY + WOLFRAM ON AI RISK.
4:17:08
Machine Learning Street Talk
Рет қаралды 92 М.
Videogames Have Terrible Worldbuilding
26:44
Adam Millard - The Architect of Games
Рет қаралды 186 М.
SH - Anh trai & Em gái || Brother & Sister #shorts
0:58
Su Hao
Рет қаралды 48 МЛН
Monster My Best Friend 🥹❤️👻 #shorts Tiktok
1:01
BETER BÖCÜK
Рет қаралды 29 МЛН
НИКОГДА не иди на сделку с сестрой!
0:11
Даша Боровик
Рет қаралды 729 М.
В Европе заставят Apple сделать в айфонах USB Type-C
0:18
Короче, новости
Рет қаралды 1,1 МЛН