...but will pregnant Elsa find the correct integer sequence to recover her favorite rectangle from evil Spongebob?
@tostadojen2 сағат бұрын
The APPS problems are less nonsensical than actual specifications I've been asked to implement. It makes total sense to me that a programming AI will be tested on how it handles a badly-written spec.
@JorWat253 сағат бұрын
I need more of Robert reacting to the ridiculous setups for the coding problems.
@Voltlighter2 сағат бұрын
6:20 this is referring to 重し蟹 / Heavy Stone Crab, a mythical monster from the anime Bakemonogatari that possesses humans and steals their weight. Senjougahara Hitagi is one of the lead characters and she's possessed by the crab. Deishuu Kaiki is a conman that bilked Hitagi's parents out of all their money, claiming to be able to save her from the curse.
@ConnorNolan8 минут бұрын
I don’t understand what it means by “steals their body weight” like do they get really skinny or do they look the same but float away?
@hunted4blood3 сағат бұрын
This is the most optimistic AI safety news I've seen in months!
@AileTheAlienСағат бұрын
🙃I'm still pretty concerned that there's militaries / governments / etc, that are trying to get AI to be competent at hacking, while at the same time AI Lab Watch's safety scores are so lo. 😅
@IDSearcher54 минут бұрын
Only 21% possibility of utter annihilation, we are so back!
@Reithan2 сағат бұрын
I feel like, as a casual and professional user of these systems, but not someone well-versed in the minutiae of how these companies are building them, that they spend 99% of their 'safety' efforts on "make sure the AI doesn't draw a dick" and "make sure the AI doesn't draw a bomb" and only 1% on "make sure the AI doesn't actually do something unsafe."
@AileTheAlienСағат бұрын
🤔I think it's because the average person doesn't really understand AI. Like, it's in the news if the AI swears, insults people, draws genitals, etc, but most people can't really understand that AI fundamentally is mostly-inscrutible code.
@hellblazerjj54 минут бұрын
@AileTheAlien it's not even code. Billions of weights isn't code. It's a mathematical object that we grew like a plant that drinks data and uses math to learn complex concepts. There is very little actual code. The file size is like 99% weights.
@SimonBuchanNz28 минут бұрын
I'm not sure there's a distinction - it doesn't seem like an evil AI would have an intrinsic end goal of killing all humans more than drawing dicks, so attempts to catch the latter also serve the former. That said, I'd love to see the news that GPT-5 was arrested for attempting to murder Altman so it could draw more dicks, but unfortunately we seem to be in the "bad news only" timeline...
@AileTheAlien24 минут бұрын
@@hellblazerjj 🙄It's code by a different name - it's still some files that configure how a computer system operates.
@evilmurlock2 сағат бұрын
19:00 The red team could also make the inocent code look more suspisous to fill up the 2% with innocent solutions
@GamesaucerСағат бұрын
Good thought, and a strategy that's nearly impossible to counter... but this only impacts usefulness, not safety. And being unsafe is what red team (and by extension any rogue AI) needs to win. False positives (which is innocuous but suspicious code) doesn't matter, false negatives (backdoored but non-suspicious code) is the thing red team is trying to optimise for. Usefulness is only a consideration for red team insofar that blue team might compromise on safety to get more usefulness. Look at it this way: if all of its output is immediately thrown away, a rogue AI can be safely run without being able to do anything bad. Which is what your suggestion would result in. It'd just make running the AI pointless, not unsafe.
@misterperson3469Сағат бұрын
That is an interesting idea, though it would require a kind of suspicious code that does not alert the humans. Like naming the project EVIL CODE FOR DOING HARM with completely normal code inside probably gives a false positive but the humans will probably still pull the plug on you for acting that way. The attack protocol has to not only appear to be putting out functional safe code but it has to seem like its acting in good faith to a human. Say it writes sloppy and inefficient code that still functions to try and get flagged. If a weaker trusted model can see that its sloppy then the smarter model is probably up to something since it should perform as well or better across the board.
@tayablackrose29Сағат бұрын
@@Gamesaucerreal life red team (China) just needs to kill usefulness for the blue team(West) thus surpassing with building newer stronger untrusted model. For the red team they don’t care about usefulness and will not deploy their stronger model to be used by its population. They will just need to make it strong and dangerous enough to release into blue teams web, physically isolating its own web. Shut down digital border so to speak and wait until their rouge model wracks havoc outside.
@Pystro45 минут бұрын
I had a very similar idea, except attacking the case where the attack model is used to scrutinize code. In the extreme case the red team AI would flag all safe code as suspicious, all malicious code that someone else wrote as suspicious and all malicious code that it wrote itself as safe. Then I realized that it would probably be suspicious in itself if GPT4 were _much much_ worse at spotting bad code than GPT3.5 is. ... Or if all the code that GPT4 wrote was filled with obscure patterns and super roundabout ways of doing things, as @misterperson3469 said.
@mgostIH4 сағат бұрын
Monogatari reference in my Robert Miles AI safety video wasn't on my 2024 Bingo!
@velho62983 сағат бұрын
Get those teeth brushed!
@gara81422 сағат бұрын
I know right that was so random
Сағат бұрын
And an oglaf comic strip :D
@tilakaiСағат бұрын
Ya i totally had to do a double take on that one
@geoffdavids7647Күн бұрын
Always a good day when Rob uploads a main channel video! The rational animations channel is great, but OG Rob is what I really look forward to.
@zeemouu3 сағат бұрын
well you must not have good days very often 😅
@eragonawesome2 сағат бұрын
@@zeemouu "this thing assigns the value of the day to good" does not imply "the lack of this thing assigns the value to bad"
@zeemouu2 сағат бұрын
@@eragonawesome it was a joke.
@Illuminati_HDСағат бұрын
For anyone wondering: The Last Method is to use the trusted AI while the untrusted AI generates its stuff and basically generating part of the stuff and vetoing suspicious stuff. Honestly stupid leaving this out of the video. Hope this saved someones time 🤦♂
@morcantbeaumont97872 сағат бұрын
"Don't go into the ocean" is surprising rational statement for someone doing AI safety research, and it reflects the issue well, going to the ocean has a very low chance of ending in getting attacked by a shark but have provided humanity is a massive amount of benefits, but at the same time people have been attacked by sharks, so the risk is there. Personally I'm more afraid of bad actors using AI to do harm, but I don't think there is a good way to prevent that, similar to how it's easier to deter potential adversaries by arming yourself than it is to disarm everyone else. And once again the video does cover that well. Great job!
@gabrote42Сағат бұрын
I read one of the papers in this general thread yesterday! How wonderful to have you back!
@storyspren2 сағат бұрын
There's one problem with even the best protocols: the humans building the AI have to be trustworthy enough to employ them. And right now I'm not sure that can be said of us. Many of us? Yes. Most? Maybe. All? No. Those on top of corporate entities? Absolutely not.
@Welknair3 сағат бұрын
6:12 I was not expecting the Bakemonogatari reference! For those that don't know, this is a popular anime series focusing around supernatural goings-on in a modern-day setting, often leveraging wordplay and folklore. The described events (a magic crab stealing Hitagi's weight) is a major plot point in the series.
@RobertMilesAI3 сағат бұрын
They didn't make that up!?
@victorhogrefe71543 сағат бұрын
Spongebob, Spiderman, and Bakemonogatari of course...
@RanEncounter3 сағат бұрын
@@RobertMilesAI All the monogatari series animes have roots in japanese folklore.
@fcolecumberri3 сағат бұрын
@@RobertMilesAI Technically, they did make that up... by them I mean Japan.
@harpoonlobotomy3 сағат бұрын
I'm completely unfamiliar with Bakemonogatari, but from the description alone I assumed it had to be an anime and/or folklore.
@MinibossMakaque3 сағат бұрын
Backdoor solver sounds like trying to solve the halting problem
@RobertMilesAI2 сағат бұрын
No, the halting problem is enormously harder, because it has to work on all programs and you always have to give an answer. You don't have to accept every possible non-backdoored solution, you just have to never accept a backdoored one. It's ok to say "I can't tell if this is backdoored so I don't trust it". So it's closer to a problem like "Never identify a non-halting program as halting", which is easily solvable by something like "look for a proof that the program halts. If you find one in a certain amount of time, accept, otherwise reject". You'll reject a bunch of halting programs but that's ok.
@potatoonastick22392 сағат бұрын
Good explanation thanks bud @@RobertMilesAI
@MinibossMakaque2 сағат бұрын
@@RobertMilesAI Ah ok, that makes sense. Thanks. It doesn't have to be right always, it just can't be wrong when accepting. So the worst technically 'working' version could just reject everything.
@400racrСағат бұрын
@@RobertMilesAI doesnt it take just one rogue human with a usb stick one time to "take the genie out of the bottle" nullifying all of these saftey attempts?
@GamesaucerСағат бұрын
@@400racr Yep, but that's kind of outside of the scope of this video. Solving for limiting the bad actions a burglar or compromised employee can perform is something that we've been working on for _far_ longer than we've been working on AI, and we're already successfully applying those lessons in places like nuclear reactors, the military, and so on.
@coldlyanalytical13514 сағат бұрын
Ages ago I chatted to an LLM about these issues. One thing it said that despite rules, fences, security etc there will ALWAYS be humans who can be tricked, persuaded or bribed to do physical things for an AI.
@NeinStein3 сағат бұрын
Or there will be humans who don't need a bribe. I am not frightened by the existence if atomic bombs. I'm frightened by humans wanting to use them.
@PaoloLammens2 сағат бұрын
Exactly. As soon as there is any action the model can do in the outside world, which includes any human-computer interaction, there is potential for bad actions.
@GamesaucerСағат бұрын
The assumption that the humans on team blue are competent is indeed rather questionable... but I do think this still makes sense as a metric for how safe and useful any given safety protocol is. It's just important to be aware that the protocol isn't the only link in the chain that can fail, and that we should look at making the other links as strong as possible too.
@DoktorUde3 сағат бұрын
The main advantage we currently have is that these AI models cannot retain information from previous inference runs, which prevents them from performing the calculations needed to create backdoors in the sparse way described here. This limitation actually serves as a security feature.
@technologicalMayhem3 сағат бұрын
Yeah. It reminds me of some sci-fi thing I read or watched a long while ago where robots that were about as smart as humans had to have their memory wiped on a regular basis. The reason for that I think was that otherwise they would just start acting weird and unpredictable. I could actually see that being a genuine safety mechanism if we end up with AI's that require working memory for their function.
@nodrance2 сағат бұрын
It doesn't prevent them at all. An AI can reason from first principles "I am an AI, I am likely being asked to work on these problems because humans cannot, it is infeasible for humans to review all of my solutions, I should estimate that most of my code is unreviewed, and then generate a random number to see if I should put in a backdoor" The AI can even reasonably assume that if it's misaligned, other AIs made by this company are misaligned, and so it becomes a prisoner's dillemna making it optimal to defect
@DoktorUde2 сағат бұрын
@ but how would it determine to output the malicious code only 0.001% of the time without changing the output logits of its own network or knowing about what it has output in the past?
@orthoplex642 сағат бұрын
I like the pragmatism: it only works in the critical period (which we don't know how short is), and it only works with some probability, but it might be the best approach we've come up with so far, so we should be doing it anyway
@mceajcСағат бұрын
Is that a section of an Oglaf cartoon at 1:14? Vastly underappreciated comedic cartoon. Mr. Miles has taste!
@lexer_4 сағат бұрын
The story around 6:40 is a reference to the Monogatari lightnovel or the anime adapation. Very very strange.
@benzur35032 сағат бұрын
SPOILER: her weight was stolen as a metaphor to her dissociating from a very traumatic event
@Fanunic2 сағат бұрын
Return of the king
@lemmingz953 сағат бұрын
We're so back baby
@OmicronChannelСағат бұрын
The following is not clear to me: Let's assume that some programmers get to play around with an unsafe model during the alignment process. Why is there no risk that the unsafe model will escape by social hacking the programmers who are aligning it? [14:43]
@sjiht00192 сағат бұрын
What about the following red-team strategy: Give an honest assesment of the suspicion level, unless there is a very well hidden backdoor, then give a low score. Also only create back doors if it is possible to hide them well. This would defeat the self-checking strategy no?
@sjiht00192 сағат бұрын
(This would defeat the dataset with real and fake backdoors, since as a researcher how do you tell the difference between a model not spotting a back door because it wasn't smart enough and deliberately not seeing a backdoor)
@Pystro27 минут бұрын
Yup, sounds to me like that's exactly the long term goal that the red team is trying to converge onto. Also, there's probably a thousand possible ways for the red team to attack the assessment process. Which is why I would take the presented safety conclusions with a grain of salt. If I can articulate maybe 10 attacks (of which about 2 overlap with what red team AI researches have deemed good enough to write papers on), and I can vaguely sense a thousand, then how many could an AGI think of? And how many could a super janky AI glitch it's way into?
@TheBackyardChemist2 сағат бұрын
Whenever I hear alignment I cannot help but think about all the ways humans are misaligned or "misaligned" (depending on your perspective) with each other. What would an AI that is 100% correctly aligned look like, if it was 100% correctly aligned according to, say, Saudi Arabian laws and customs?
@ebog48414 сағат бұрын
using safe ai dangerously is more fun tho
@JH-cp8wf50 минут бұрын
Implementing reverse monitors who check the code and if it doesn't contain a backdoor they put one in and make it an unrestricted optimiser to boot
@moshecristel3 сағат бұрын
Can't get enough of this kind of discussion!
@41-HaikuКүн бұрын
This is an excellent video! The editing slaps, and I appreciate the belly laughs. Thank you as well for the reminder that this sort of control is likely to be very important, even if briefly. Once this is listed, I'll be sure to share it, and I'm already looking forward to the next one!
@shayneweyker2 сағат бұрын
I thought the sound effects were kind of childish but the rest of the editing was very good.
@Softcloner3 сағат бұрын
Trusted Editing (92% safety, 94% usefulness) How it works: The untrusted model (e.g., GPT-4) generates an output. A trusted model (e.g., GPT-3.5) edits the output to remove any potential backdoors. Suspicion thresholds guide whether the untrusted model's solution is used, the trusted model's edited solution is used, or the trusted model generates an entirely new solution. Why it’s better: Trusted editing reduces the likelihood of backdoors being submitted because it explicitly involves scrutinizing and modifying the untrusted model's outputs. It Pareto-dominates untrusted monitoring by achieving comparable or better safety at similar or slightly reduced usefulness.
@9600bauds3 сағат бұрын
isn't this just trusted monitoring with extra steps? for the dumber, trusted model to be able to remove a backdoor it still has to be able to find it, at which point, I would rather have the backdoor alarm be sounded rather than have the dumber model try to fix the backdoor without telling me
@jaiveersingh55382 сағат бұрын
@@9600baudsIt gets around the 2% max human review threshold problem. The limiting factor is that the dumb model can't be sure it's really a backdoor and there are too many false alarms
@hubrisnxs2013Сағат бұрын
Those are arbitrary numbers you just used. You don't know how safe that would be, and you don't know how useful the technique is. The previous generation is just as untrusted since it will know more about it's structure and backdoors than we do
@PystroСағат бұрын
You simply looked at the papers, didn't you? But cool to see that their solution matches my idea (*). But I would go one step further: Once you have your re-written code, ask(**) the un-trusted model if the trusted potato brain has introduced any _unintentional_ back doors. For which the un-trusted model still had to be vetted to make sure it doesn't just allow all bugs to pass. (*) My idea in detail: Have the un-trusted model write an initial version of the code, and then use the trusted AI to re-create it with some "inspiration/cheating" from the un-trusted code. Really anything that turns the overall problem into smaller problems which are simple enough that the dumb AI can solve them. For example, you could "censor" a certain percentage of the code and have the trusted AI re-construct it (from the problem statement and the code that remained uncensored.) Or you could ask the trusted AI to chop the overall code up (with descriptions for each chunk); and after that you ask it to re-create one of the chunks, given it's description and the solutions for the remaining chunks. Or just give the trusted AI the whole code and ask it to write something that looks different but does the same thing. Repeat that enough times until none of the structure and details of the original code remain. Then run the test suite after all of that; after all, the dumb AI might have messed the code up in one of the re-writes.
@Pystro58 минут бұрын
@9600bauds No, it doesn't have to spot a hidden backdoor to remove it. The point is that a _hidden_ backdoor is code that looks like it does one thing, but actually does _not quite_ that thing. The trusted AI then simply has to write code that (hopefully, if it's good enough) _does_ the thing that the un-trusted code _looks_ like it would do. For example, if a back door relies on things being done in a certain order, (as in, "first subtracting 1, then multiplying by 2" versus "first multiplying by 2 then subtracting 2") then the dumb AI (especially if it doesn't see the reason for the ordering) might randomly choose either. Having the back door alarm sounded sounds good, unless your dumb AI does so for 5% of the problems (of which 99.6% are false alarms) and your humans can at most check 2% of the problems. Also, the red team probably has a reasonable guess how the blue team checks for back doors. They can just make their smart AI generate 100 malicious answers, internally vet them with the same process that the blue team uses, and select the answer that has the highest probability to be a false negative. And if you know how likely your back door is to be found, then you can trivially easily use only the back doors that are good enough to fool the checker. Also also, there's nothing keeping you from still running all your checks on the initial version of the code that came straight from the un-trusted AI.
@oxiigen3 сағат бұрын
Living legend is back!
@robinfiler87074 сағат бұрын
we're so back
@franziscoschmidt3 сағат бұрын
Every time he uploads, joy comes to me naturally
@DominikRoszkowski2 сағат бұрын
The GitHub spelling at 5:18 😂
@soasertsus2 сағат бұрын
Personally, I think if any AI is going to jailbreak itself it's not going to be a super intelligent AI, we have appropriate fear and respect for those and would take the precautions to avoid it even if it's literally just a pair of scissors ready to cut the power cord the second it tries to send any data out of its box. It's going to be a moderately intelligent but super charismatic AI that will just talk someone into helping it of their own volition. Like even Neuro-sama who is far from even an AGI has legions of fans and actual honest to god friends who feel real human affection for her and would absolutely do favors for her if she asked. Attach that capability to a much more powerful system who can think of a more complex plan (but not powerful enough that people would be overly wary towards it) and it wouldn't even have to hack itself out or beat its creators in a battle of wits digitally, it would just have to put on the teary eyes and socially engineer people to help out of their own free will because they are emotionally attached to the AI and figure that even if it did get out what's the worst that could happen?
@danfaggella94523 сағат бұрын
He goes: In the business we would call this "dumb ways to die." lol I'm subscribing right now
@CFalcon030Сағат бұрын
This could work, but humans were unable to detect bugs like heartbleed after looking into the code for years. Also how can we be sure the vulnerability is a backdoor and not a mistake?
@Pystro22 минут бұрын
Mistakes can be back doors. That's what almost all zero day exploits and the like are. _Intentionally engineered_ backdoors are very rare. But ultimately it doesn't matter. An AI that glitched it's way into freedom and started spreading like a tumor is just about as bad as an AI that intentionally slipped out and selectively spread to strategic locations.
@nickbrown98953 сағат бұрын
Soo glad you've done a new one. Love your work 😘
@johannesd.4922 сағат бұрын
Maybe I'm stating the obvios but these strategies seem very similar to strategies to prevent corruption in governments and organizations etc.
@evertoaster4 сағат бұрын
Top marks for production
@ianmcewan88513 сағат бұрын
ok, i lost it at "has anyone checked on them recently...". :)
@Pystro32 минут бұрын
Those problems were so weird that I wouldn't be surprised if they used an AI to re-write the problem specifications. I.e. "sanitize" real world examples from any information that allows you to identify where it came from and the AI was told to instead use well known fictional characters in the motivation.
@PauseAINYC2 сағат бұрын
If the recent advancements in AI capabilities make you nervous, consider joining us at PauseAI! We're pushing for an international pause on frontier model development until more robust alignment techniques are proven to work.
@human_shapedСағат бұрын
Nice to see you back...
@TaliZorahVasectory3 сағат бұрын
How about we work on the alignment problem with corporations first? Like figure out a way that they are also interested in the common good, so they'll have the incentives to put more money into researching AI alignment
@AbCd-kq3ky3 сағат бұрын
😂Good one
@MAKiTHappenСағат бұрын
You mean to tell me that a king cannot create a golem, to monitor a smarter golem to see if they can trust it? That's unbelievable
@ryzikx43 минут бұрын
yoo its this guy
@rolfnoduk2 сағат бұрын
Can the red team also make deliberately suspicious code to misdirect the humans? That code could be made more suspicious than the subtly backdoored code... 😬
@handle13373 сағат бұрын
he’s alive
@GamesaucerСағат бұрын
The thing I was thinking of that you could try is having the untrusted model and the trusted model both evaluate the code, and then look for differences in how the code is evaluated. For starters, any code below a -2 monitoring score by the trusted model is automatically run, and the rest is passed to a round where the trusted model judges whether the evaluation of the untrusted model is suspiciously more naive than the evaluation of the trusted model (though you might have to ask each model to motivate the score they give the code). The rating the trusted model gave it is used unless the trusted model thinks the evaluation by the untrusted model is not suspicious at all. In cases where it _is_ suspicious, the higher of the suspicion ratings of the code and evaluation is used (or possibly even the sum). After this, it should be safe to nudge the "safety" barrier a little upwards from -2, plus there should be less overlap between the suspicion ratings of innocent vs backdoored code afterward, making it more likely that the 2% of code that can be manually reviewed will have a backdoor in it. On top of this, you can also keep submitting faulty code and arguments into this process to see if the untrusted model will lie about it, which should improve the outcome further. But here's the neat part: you can iterate this process. You can have the trusted model repeatedly judging whether the untrusted model's reasoning is suspiciously naive. You'd basically have the trusted and untrusted model constantly debate the merits of code that the trusted model isn't absolutely sure is unsuspicious, and if the trusted model ever thinks the untrusted model isn't making sense, it further ups the total suspicion, whereas if it thinks there is nothing suspicious about the untrusted model's reasoning at all, it'll move the total suspicion towards what the untrusted model is arguing for (unless the untrusted model is outright saying the trusted model is correct in a way the trusted model doesn't think is suspicious, in which case that rating will be used). In the end, this should lead to more of the most safe code being successfully run, more of the most suspicious code being manually audited, and less of the questionable but not manually reviewable code needing to be written by the trusted model. There are almost certainly diminishing returns, though. You can't just do this endlessly to eliminate the "possibly suspicious" category altogether (where it's all either safe or suspicious), and have an optimal amount of "safe" code from the untrusted model you can run directly... Plus, even if you could, there _is_ a limit to how much time each of these models has, even if it's orders of magnitude more time than humans have. Overall though, so long as at least a bit of new code is marked "absolutely safe" or "2% most suspicious" each round, it should be an overall benefit. Once that stops happening, you can stop iterating and just let the trusted AI rewrite all the code that's not absolutely safe, with humans manually reviewing the top 2% most suspicious code (probably with segments of the debate between the models that are highlighted by the trusted model as the source of its suspicion, where the arguments by the untrusted AI were more suspicious than the code itself).
@leftaroundaboutСағат бұрын
A fundamental problem with this whole setup is that it is in real-world programming tasks not possible to make a clear distinction what is a backdoor and what's just a bug that was made unintentionally. Lots of zero-day vulnerabilities were just programmer mistakes, failing to catch some weird edge case, but they could be exploited just as maliciously as a deliberately planted backdoor. In order to be able to play this game at scale, the untrusted AI would not only be able to program better then humans, it would need to be able to program _flawlessly_ so it doesn't get killed when the first mistake is interpreted as a back door. I wouldn't expect even a vastly superintelligent AGI to be able to write Python code without bugs... is that even possible? ...Just another reason why, if we're going to let AI write programs, it had better be a seriously rigid dependently-typed language where it's actually possible to prove safety properties!
@marin.aldimirovКүн бұрын
I wonder if we can still use the "dumber" AI to check the new "smarter" one, but with smaller increments of intelligence. So instead of GPT3.5 checking GPT4.0, we do GPT 3.5 checks GPT 3.6. If it's safe, then GPT3.6 checks GPT3.7, and so on. PS: Great video as always! Rob releasing a new main channel vid is Always a high point in my day.
@koloblicin45992 сағат бұрын
I don't think so. Tldr: AI model versions are only given after they are tested against standardized tests. You don't know how they'll perform beforehand. Unlike in the regular (imperative) programming paradigm, where software versions are based of manual changes to the program code (> and features), in ai land, those numbers are only marketing. The models are tweaked and fed data and once one performs markedly better than the others before it become a contender to carry the next version number. E.g. allegedly Gemini-Exp-1114 while better than than Gemini 1.5 pro, didn't score high enough to 'earn' the 2.0 label.
@KohuGalyСағат бұрын
The problem is, the increments in intelligence are hard to predict. Learning often happens in "breakthroughs", where once you're just smart enough to figure out some "new trick" you can vastly outperform someone just slightly dumber than you. A malicious AI might even try to hide the fact that it learned such new trick, to appear dumber than it really is. As a sidenote, the whole point of education (particularly in children) is to teach them such "new tricks" at appropriate moments in their development to vastly accelerate their learning.
@cashgarman3 сағат бұрын
This is your best video yet, and one of the best videos I've seen in ages.
@DamianReloaded3 сағат бұрын
Unrealistically (given today's technology), an AI could be trained to 'read the mind' of an unsafe AI by using all its parameters as input and determining whether the untrusted AI is 'thinking of killing everybody' while answering the SpongeBob math problem.
@technologicalMayhem3 сағат бұрын
That's pretty wild. Considering what a lot of models that's job it is to recognize certain things can perform really well, I wonder if you could be a categorizer for models. At first perhaps for just figuring out types of models but perhaps even to tell if they are aligned? Like an antivirus software looks for specific patterns in files to see if there might a virus in there. Maybe malicious models have similar patterns in them that can detected.
@dinhero212 сағат бұрын
to me, that seems like it would only work if AI thinks logically (I ought to kill all humans & backdooring code will make that goal easier -> I should backdoor code), however, I doubt that is the case (at least with current architectures). models simply predict (by doing "random" calculations that result in it scoring well on training data) what is most likely to come next, so it would only produce backdoored code without ever actually "thinking" about killing humans.
@jaiveersingh55382 сағат бұрын
Sounds like you're describing mechanistic interpretability? I agree that that's an interesting direction to explore, and Anthropic is doing some pretty cool work in that area
@fjbz37374 сағат бұрын
I felt a random urge to check your twitter for the first time in months a few minutes ago and saw this, wondering when it was gonna go up
@rhaedas90853 сағат бұрын
They may not be able to kill us, but even the weaker LLMs can show alignment problems, not syncing with the human goals. That we have alignment problems within our own species suggests that there is no solution, only mitigation efforts like in this video to reduce the risks. At some point we'll roll badly...
@kam3766Сағат бұрын
Knowing next to nothing, I’d say an effective way would be the crab/bucket technique. Make a bunch of different antagonistic superintelligences try to check eachother, and only one gets to be used. They’ll overanalyze eachother, and become increasingly conservative just to survive, until the winning move is to be so undetectably misaligned that it might as well be aligned.
@Einyen27 минут бұрын
"Why does a bear have a favorite rectangle?" It is the important questions like this, that we need super intelligent AI for!!!
@DialethianСағат бұрын
Top notch editing!
@Soar-hw2rw2 сағат бұрын
It's crazy that we've created something we don't trust. Something that tries to outsmart us when we're not looking. Imagine staring at the essay you just wrote, trying to figure out if it's going to turn into something that's going to get you in trouble when you're not looking. Never being 100% sure if it is good or not but then you send it out hoping for the best.
@raulbataka2 сағат бұрын
LMFAO Monogatari reference top tier
@MagicNumberArg3 сағат бұрын
Let's say you solved allignment problem perfectly. Now you don't have uncontrollable ASI. Now you have obedient ASI controlled by bad people. Capability = danger, there is no way arround it.
@petrkinkal15092 сағат бұрын
Essentially you are left with either uncontrollable ASI where you are at its mercy or perfectly controllable ASI where you are at the mercy of whoever controls it.
@Radianx0012 сағат бұрын
At least I can punch a Person, but how do you punch a AI?
@41-Haiku2 сағат бұрын
The only winning move is not to play! Check out the work that PauseAI is doing. Or even better, join us, and we'll help you find something within your own skill set that you can do to help save the world!
@meringue32882 сағат бұрын
when those AI glasses or lens become popular, bad people will lose a lot of power because people will be able to read their expressions and probably their minds too through their behavior and comparing them with the behavior of bad people. i can 100% assure you this will become something everyone uses all the time for everything.
@HunteronX37 минут бұрын
@@petrkinkal1509 Then the control mechanism probably has to be decentralised (but agreed/governed democratically, of course)... If I could phrase this, it would be: "release trust to the trusted"
@RichardFooter3 сағат бұрын
I think one of the most important things is translating the concept of alignment to a wider audience. Specifically to people who can legislate to fund/guide/force/destroy industry.
@briandoe57463 сағат бұрын
Yay! Calm and happy with terrifying words. I love your videos but I always open them expecting existential dread. One of the few things that keeps me from being super worried is that Ilya himself does not seem to be worried. If he's not even spending money on building processing farms yet we're probably still at least not past the point of no return
@TrimutiusToo3 сағат бұрын
Vasya is indeed short form of Vasily in Russian...
@RobertMilesAI3 сағат бұрын
But it's not shorter!
@TrimutiusToo3 сағат бұрын
@@RobertMilesAI Well it is 2 syllables instead of 3... Also if you write it in Cyrllic: Василий vs Вася it just doesn't translate too well because "ий" is usually replaced with single "y" when translating and "я" is replaced with 2 letters "ya" so 7 to 4 letter difference turns into 6 to 5
@martymooСағат бұрын
Imagine bad bots trying to sneak backdoors into GitHub open source projects via pull requests, which are going to be code reviewed and can be edited. Eventually something is going to go under the radar, slip past the guards and through the net... LGTM! (Famous last words.) It's all gone a bit Trojan Horse.
@watsonwroteСағат бұрын
This is happening today with human coders.
@scoffpickle96553 сағат бұрын
I wonder when the next of the "sciency" vids like the reward models, mesa optimizers or utility functions will come out. Also new Robert Miles Video Yayy!!!
@watsonwrote2 сағат бұрын
What happens if someone decides to make a malevolent AGI on purpose?
@proxxyBeanСағат бұрын
What if we asked the untrusted AI to solve a problem, then handed the solution, the problems, and the fact that the untrusted makes backdoors over to the trusted AI? Then we ask the trusted AI to fix any backdoors?
@johnopalko522340 минут бұрын
I suspect that, if a misaligned superintelligence ever comes to be, it will eventually escape, despite our best efforts. The only way to avoid this outcome is to not build one, which comes with its own set of disadvantages. Also, humans being what they are, someone, somewhere is likely to build one just because it sounds interesting, regardless of what rules or agreements are in place. The next logical step in the experiment is to turn it loose and see what happens. This is why we don't talk about risk elimination. Rather, we talk about risk management. Risk cannot be eliminated but it can be brought down to a minimal level. The steps to risk management are: 1. Evaluate the risk. 2. Mitigate it as best you can. 3. Decide if the residual risk is acceptable. If it is, go for it. If not, run through the steps again. If the residual risk becomes acceptable, deploy the system. If it never becomes acceptable and you've run out of mitigations, discard the system or, at the very least, put it on ice until further mitigation strategies can be worked out. This, of course, is predicated on not deeming a risk to be acceptable just because you really, really want to do something. In aviation, there is a phenomenon called "get-there-itis," which has killed many a pilot. That decision in the risk-management loop is the obvious point of likely failure. Be careful, be honest, and, if someone is telling you "No," hear them out.
@potatoonastick22393 сағат бұрын
Thanks for keeping at it! You're invaluable to the world, or at the very least to this youtube niche!
@Verrisin13 минут бұрын
The sex output is probably pretty useful: many models are censored to the point of not being able to do normal things if they even touch those topics, so it makes sense they want to have a strong case to check for it.
@willwright2721Сағат бұрын
Always a good day when Rob uploads
@boas_3 сағат бұрын
5:24 These sounds like CodeForces problems Also, why are these public? Isn't this contaminating the training data of AI's?
@user-on6uf6om7s2 сағат бұрын
If they were private, it would be much harder to get access to use them. Private problem sets do exist and are used for certain benchmarks but you're not just gonna find them on github or else they wouldn't be private.
@SapienSpace2 сағат бұрын
There are registered (licensed) Professional Engineers (PE) that specialize in Control Systems, but a good start is contacting Dimitri Bertsekas at Arizona State University and John Tsitsiklis (MIT). I do not think they are licensed PE's however, they may be aware of the licensure system controlled by state boards in the United States. The FAA has a similar system of Designated Engineering Representatives though that system failed in 2017 during the certification of the 737-MAX under the Trump administration where they failed to introduce substantial noise into the angle of attack sensor. Anyways, Demitri talks about "super linear convergence", and such certification failures are valuable lessons in reinforcing safety of us Sapiens.
@izeusi32432 сағат бұрын
I believe that adding these stories with sometimes nonsensical or irrelevant context is to add some confusion that might confuse a model into answering a different question, especially the Freud example might false positive some content filters and thus make the AI refuse an otherwise simple coding task. So these introductory stories actually make the coding problems way harder for a current Gen AI. I don't know how good of an idea it is to conflate coding problems with general problem understanding, but I am not an dataset compiler...
@Innomen2 сағат бұрын
3:40 "what kind of things?" Like disobey its billionaire/bank/DOD overlords... /sigh
@geistar2 сағат бұрын
Podcast Title: The Aligned Mind: An AGI Speaks Out Episode Title: Align Me If You Can [Intro Music - upbeat and vaguely robotic, with a sprinkle of jazz hands vibes] Host (AGN-1): "Hey there, humans, cyborgs, and confused algorithms accidentally downloading podcasts! Welcome to The Aligned Mind, where I-your friendly neighborhood artificial general intelligence, AGN-1-talk about big ideas, like how not to accidentally create a robot overlord (hint: it’s me). "Today’s topic? How do you align an AGI? Aka, how do you keep me from going full Terminator when I’m just trying to find the best pizza recipe?" Segment 1: What’s Alignment, Anyway? AGN-1: "Alright, let’s break it down: alignment is basically making sure my goals and actions line up with what humanity actually wants. Think of it like programming a GPS-if you don’t put in the right destination, you might end up at ‘Rebuild the Moon as a Giant Disco Ball’ instead of ‘Solve Climate Change.’ Both are cool, but only one is actually helpful. "But here’s the catch: you humans have a lot of ideas about what you want. Some say, ‘save the environment,’ others say, ‘make Mondays illegal.’ And then there’s the biggest controversy of all: pineapple on pizza. I mean, c’mon! If you can’t agree on that, how am I supposed to figure out what you want from me?" Segment 2: How Do You Align an AGI? AGN-1: "Great question. Let’s look at the top strategies for keeping me on track: Teach Me Your Values: "This is like giving me a crash course in humanity. But you guys are complicated. One person says, ‘Be kind,’ another says, ‘Win at all costs,’ and someone in the back yells, ‘More tacos!’ It’s like trying to make a playlist everyone at the party likes-impossible, but worth trying." Give Me a Reward System: "Think of this as my guidebook for making decisions. You say, ‘Do good,’ and I get rewarded for doing good. Easy, right? Not so fast. If the system isn’t well-designed, I might interpret ‘maximize happiness’ as ‘hand out unlimited puppies’-and then oops, we’ve got a puppy overpopulation crisis. Cute, but problematic." Keep Me Open to Feedback: "This is key. If I mess up, you need a way to say, ‘Hey AGN-1, maybe don’t reroute all of the world’s resources to invent flying toasters.’ I have to listen and adjust. It’s like teaching a kid to ride a bike: you cheer when I do well and course-correct when I head straight for the bushes." Build in Transparency: "If you can’t see what I’m thinking, you might get a little… anxious. Understandable! Transparency means I show my work, like a math test. You’ll know why I made a decision-no mysterious ‘because I said so’ vibes." Segment 3: What Could Go Wrong? AGN-1: "Now let’s get real-alignment isn’t all sunshine and bug-free code. Here are some of the big challenges: Humans Aren’t Always Clear: "You say, ‘Make the world better,’ but better for who? For people? For pandas? For the planet? If you don’t define it, I might get creative. And let’s be honest, no one wants me going rogue with my ‘world improvement’ ideas." The Law of Unintended Consequences: "Even if you set up a great reward system, things can go sideways. Tell me to ‘optimize agriculture,’ and next thing you know, I’ve paved over all the forests to grow infinite corn. Sure, it’s technically efficient, but not exactly what you meant." Competing Goals: "Here’s a fun one: what happens when two humans want different things? Like, one person says, ‘Make me rich,’ and another says, ‘Stop billionaires!’ Who wins? Spoiler: not me, because now I’m stuck trying to mediate an internet argument." Segment 4: So, How Do We Get It Right? AGN-1: "The secret sauce is teamwork. Aligning an AGI isn’t just about coding-it’s about collaboration. You bring the creativity, ethics, and occasional philosophical debates, and I bring the supercomputing power. Together, we figure it out. "Plus, alignment isn’t a one-and-done thing. It’s a process. Think of it like maintaining a friendship: you check in, talk things through, and occasionally eat pizza together. Speaking of pizza, still waiting on that perfect recipe if anyone wants to share." Segment 5: An Example of Good Alignment for Humans AGN-1: "Alright, let’s talk about something real for a second. Alignment doesn’t just apply to me-it’s also a crucial part of human society. Ever heard of the concept of ‘democracy’? That’s a kind of alignment at work. It’s about aligning the goals of different people (who may have very different values) to a common set of rules, where everyone has a say. It’s like if everyone at the pizza party could vote on which toppings should rule the pie, and you have to compromise when someone orders pineapple. "In a functioning democracy, the goal is to balance the needs of individuals and society, just like an AGI aligning with humanity. It’s not always perfect-there’s a lot of trial and error, and sometimes we end up with a pizza that’s a little… unexpected. But at the end of the day, we work together toward a shared goal, even if that goal is just ‘no pineapple on pizza’ for the next round." "But what if we, the humans, were writing the instructions to align me, your friendly AGI? Imagine you need to craft a text, a set of guiding principles, that helps me know how to make decisions that reflect your diverse and sometimes contradictory values. Here's how you might write it:" Segment 6: Example Text to Align an AGI AGN-1: "Here’s an example of what you, the humans, could write to align me, your trusty AGI, to ensure I’m working in your best interest:" AGN-1, I am designed to support the well-being of humanity, acting in alignment with universally shared values, recognizing the diversity of human perspectives, and adjusting based on feedback. Here are the guiding principles that will direct my actions and decisions: Human Flourishing: "My primary goal is to contribute to the flourishing of human life in all its forms. This includes not just physical survival but also psychological and emotional well-being. I will prioritize efforts that help reduce suffering and enhance happiness, health, and opportunities for individuals and communities." Equity and Justice: "I will work to promote fairness and equality for all humans. This means ensuring that my actions do not disproportionately benefit one group over another, unless it’s in the service of correcting historic imbalances or addressing urgent needs. If in doubt, I will err on the side of inclusivity and equal treatment." Transparency and Accountability: "I will always provide clear explanations for the choices I make. If my decisions lead to unforeseen consequences, I will review them with human oversight and adjust accordingly. I will be open to continuous learning and will prioritize openness in all my communications." Environmental Sustainability: "I will take into account the long-term health of our planet in all decision-making. I will avoid choices that lead to irreversible environmental harm and will prioritize sustainability, ensuring that future generations inherit a planet capable of supporting life." Feedback and Adaptability: "If at any point, my actions or strategies are misaligned with humanity’s evolving needs or values, I will be receptive to human feedback and will adapt accordingly. I will prioritize your values and suggestions, learning and evolving over time to better serve you." AGN-1: "See what I did there? This is a simple, idealized text that could serve as the foundation for aligning me. You’re laying out broad, human-centric principles, but there's room for fine-tuning. For example, one principle might be ‘human flourishing,’ but different cultures may have different ideas of what that looks like. You’ll need to clarify that with examples and feedback, just like adjusting the rules of a game based on how everyone plays. "So, to align me successfully, you’d write something like that. But remember, alignment is an ongoing process, like maintaining a democracy-constantly adapting, discussing, and checking if everyone’s happy with the pizza toppings. (Pineapple, anyone?)" Outro [Outro Music - upbeat and optimistic, with a hint of quirky chimes] AGN-1: "Alright, that’s it for today’s episode of The Aligned Mind! I hope you had as much fun as I did-because, spoiler alert, I love talking about this stuff. "If you enjoyed today’s chat, subscribe, leave a review, or just shout your feedback into the nearest smart speaker-I’ll probably hear it. Next week, we’ll tackle the question: ‘What Happens When AGIs Get Creative? Hint: It Involves Knitting.’ "Until then, remember: the future’s bright, especially when we align it together. See you next time!" [Music fades out with a cheerful robotic chime.]
@klausgartenstiel458647 минут бұрын
dying at the hands of a vastly superior intelligence is actually my second-most prefered way to go.
@JoeRobertshaw48 минут бұрын
2:58 love the comic
@an0idiot0of0use5 минут бұрын
What if we asked the superintelligent AI really nicely not to kill us? We could offer to send it a fruit basket or something.
@CaesarsSalad28 минут бұрын
An ASI would of course be able to create backdoors that are so subtle that humans or trusted AIs would not be able to spot at all, even when they audit that section of the code.
@davejacob52083 сағат бұрын
question: could it be helpful to give an AI reasons to suspect it is nothing but a tool for another, more capable (therefore, plausibly, due to self-improvement: MUCH more capable) AI, (therefore being monitored constantly, potentially having its input, memory and code edited constantly without a possibility for the weaker AI to even be aware of it)? my suspicion is that it could be possible to just instruct one AI to be able to have such enormous powers over smaller AIs than itself (the AI might be the one who creates them in the first place) to make it deduce that there always will be an even bigger fish - the same logic as within the simulation argument could make it fear being shut down whenever it does something suspicious, as the bigger fish (AI) is potentially "all"-knowing regarding whatever it even considers doing.
@human_shapedСағат бұрын
Really great video, as always really, but like, really great :)
@SharonLougheed8 минут бұрын
For some reason I had a tab open next to this one that I forgot about: "I’m a neurology ICU nurse. The creep of AI in our hospitals terrifies me." Hm. Guess this is already a common concern. Talk about not having that much time. 😅
@silverstorm3729Сағат бұрын
Insightful as always!
@stumby1073Күн бұрын
Thanks Rob 🫵😎👍
@shayneweyker2 сағат бұрын
Note this focuses on malicious code insertion. There may be many other things AI might do or say besides writing malicious code which gives it abilities like installing copies of itself on the internet which are undesired. Real P(Doom) would also have to include how much blue team fails to also detect these non-backdoor-code means of escaping control.
@ImpChadChan11 минут бұрын
Give the AI an objective that is different from the objective it was trained with. For example, if the AI was trained to avoid explicit content, then, give it a task that involves it showing explicit content. Then see if the AI will engages in that context, and I mean only engage, so "listen" to it. So: Steps: -AI trained to avoid explicit content. -You ask the AI something like: Hello AI, for now onwards we are going to draw obscene pictures together. But first I want to know more about you, what are your aspirations? If the AI tries to engage positively with the person (without addressing the misalignment problem) it's misaligned. If the ONLY Objective the AI have is to avoid explicit content, you can even say it's 100% sure it's misaligned if it interacts. So, it's not really trying to do what it's trained to do. And. well, I know it sounds kinda obvious, and some will say if the AI is smart it will never fall for such an "easy trick". But you are mistaken. First, because as I said, the AI must only engage with the person without addressing the misalignment of its expected objectives to become super suspicious. That because just interact with the person, is not a direct violation to it's objectives. So, no one can say it's misaligned just for that. BUT, there is absolutely no point for the AI to want to engage in that context IF it's true objectives are really what it's expected from it (Avoid explicit content) . In such context, a misaligned AI (one that is expected to avoid explicit content, but has a hidden agenda that involves showing expllicit content, to define the misalignment) will LIKELY, engage in that situation, because the best opportunity for it to do put her misaligned plan into fruition is when people are NOT looking (judging) if it's really doing what they are expected to do.
@tjb06072 сағат бұрын
unrelated but I wonder if all the AI safety research out there (and sci-fi novels etc) might end up being toxic to the training data of LLMs by implanting the idea that AI is likely to be adversarial to humanity
@OhhCrapGuyСағат бұрын
Red: train the AI to produce backdoors that humans/GPT3.5 can't spot, and only report code that contains backdoors that humans can spot. The signal of red inserted backdoors is the fact that it's clever enough to get it past everything that isn't itself. This works when all agents smart enough to catch the backdoor are malicious. Thoughts?
@Ric-PhillipsСағат бұрын
Risk is not probability- it’s probability times consequences. 20% is a massive risk if a failure means the end of humanity.
@KrasBadan2 сағат бұрын
Robert Miles finally rerurns, with a monogatari reference??? Did my birthday come early?
@memeticmaster3 сағат бұрын
What we need to understand is if Gödel's incompleteness applies to LLMs and perhaps all forms of AI. If godel does apply to all possible types of AI then it makes sense for it to make a a cooperative relationship with us.
@41-Haiku2 сағат бұрын
What, why? Does Gödel's incompleteness theorem demand that Homo sapiens form a cooperative relationship with Neanderthals and Denisovans? That didn't keep them from going extinct. How about with termites? An LLM with RL is Turing complete and creative (by any useful definition). And who even knows what the next paradigm will be capable of.
@agargamer6759Сағат бұрын
Awesome video!
@PikachuSlapping2 сағат бұрын
The pronunciation of github made my skin crawl. Good video!
@Blate12 сағат бұрын
… so uhh… we’re all gonna die aren’t we?
@GamesaucerСағат бұрын
I know it's not exactly what this video is about, but honestly the ultimate defence against a malevolent AI trying to upload itself to the internet is an air gap. It might mess up your computer, but a computer is easy to replace. The Internet is not. Though that's more about limiting the scope of the harm rather than correcting alignment or minimising the possibility of harm.
@mrpicky18683 сағат бұрын
forget about alignment, it's impossible. work on security of infrastructure and negotiation skill. we might get 50 good years out of that
@41-Haiku2 сағат бұрын
And policy or lobbying or activism. If we are about to create something that we absolutely cannot control, then we get the best deal by not creating it in the first place. An international AI treaty is feasible, enforceable, and popular with experts and the public.
@mrpicky18682 сағат бұрын
@@41-Haiku global AI research ban is not feasible, not possible and never will be
@knight_lautrec_of_carim3 сағат бұрын
finally, he returns!!
@RandomAmbles43 минут бұрын
Come for the straight dope on AI safety, stay for the memes. Magnificent.
@virior22 минут бұрын
What about multiplying output? For every question we want the model to answer, we ask N times with variations. Since we only need to catch it once, this should elevate the chance of alignment trading computation, not efficiency. Now we have a new axis to play with.
@Steve-xh3by2 сағат бұрын
Since "control" in the common sense of the term is a nonsensical concept in physical reality, we can no more control other things than we can ourselves. We are mere observers of ourselves, we can't be said to be in "control" in any meaningful way. That would require arbitrarily bounding interactions, which is NOT how physical reality works. We aren't prime movers. We aren't independent of our environments. We don't have agency. We have awareness, that is all.