RE-Bench: measuring AI agents at AI R&D vs human experts

  Рет қаралды 7,321

Samuel Albanie

Samuel Albanie

Күн бұрын

Пікірлер: 10
@Think666_
@Think666_ Ай бұрын
O3 might impact this paper
@SamuelAlbanie1
@SamuelAlbanie1 Ай бұрын
Agreed. I'm curious to see the results of the updated eval.
@person-jw7vb
@person-jw7vb Ай бұрын
o1 pro alone will probably heavily impacted this paper
@sashetasev505
@sashetasev505 Ай бұрын
Awesome treatment of a great paper! It feels like someone is doing a calm ASMR deep dive on the paper that begets Skynet, but that’s beside the point 😂😂 Liked, subbed and commented - may the algo-gods kindly extend your reach ❤
@poiitidis
@poiitidis Ай бұрын
nice
@CrispinCourtenay
@CrispinCourtenay Ай бұрын
Interesting, but inference and training on such a small number of last generation GPUs makes this a thought experiment and intellectual stretch, rather than AI vs. Human.
@kaio0777
@kaio0777 Ай бұрын
I agree plus this is not on the latest AIs at the cutting edge either plus this is just testing not put to the grind to optimize in a company setting Honestly, the real test is to make a shell company completely run by AIs at the top with only humans doing the blue-collar work. If they can run as good as humans or better with less oversight work, as we know it, the work in the future is cooked.
@SamuelAlbanie1
@SamuelAlbanie1 Ай бұрын
@kiao0777 Thanks for the comment. I agree that having a company completely run by AIs would be more representative of full automation. However, it's worth noting that a key goal of this kind of work is to make measurements before full automation occurs (often with the goal of informing safety mitigations that would be best to set up in advance).
@SamuelAlbanie1
@SamuelAlbanie1 Ай бұрын
@CrispinCourtenay Thanks for the comment. I agree the comparison is imperfect, but I think it does a reasonable job of capturing R&D tasks over the time period of one working day. Also, it's worth noting that many AI R&D experiments are conducted at small scales on older generations of hardware to reduce cost (so in that respect, it is not necessarily unrealistic).
@kaio0777
@kaio0777 Ай бұрын
@@SamuelAlbanie1, yeah, measurements are good-that's why I watched the video-but what they are testing might not work out in practice imo.
Alignment Faking in Large Language Models
24:04
Samuel Albanie
Рет қаралды 8 М.
Mamba - a replacement for Transformers?
16:01
Samuel Albanie
Рет қаралды 253 М.
Маусымашар-2023 / Гала-концерт / АТУ қоштасу
1:27:35
Jaidarman OFFICIAL / JCI
Рет қаралды 390 М.
Vertical AI Agents Could Be 10X Bigger Than SaaS
42:13
Y Combinator
Рет қаралды 652 М.
How AI Cracked the Protein Folding Code and Won a Nobel Prize
22:20
Quanta Magazine
Рет қаралды 389 М.
Quantum Computing: Hype vs. Reality
44:45
World Science Festival
Рет қаралды 496 М.
What Does the AI Boom Really Mean for Humanity? | The Future With Hannah Fry
24:02
AI Is Making You An Illiterate Programmer
27:22
ThePrimeTime
Рет қаралды 284 М.
AI can't cross this line and we don't know why.
24:07
Welch Labs
Рет қаралды 1,6 МЛН
Anthropic - AI sleeper agents?
19:35
Samuel Albanie
Рет қаралды 2,2 М.
DeepSeek is a Game Changer for AI - Computerphile
19:58
Computerphile
Рет қаралды 1,3 МЛН